Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume with single arg #457

Closed
wants to merge 22 commits into from

Conversation

alexstoken
Copy link
Contributor

@alexstoken alexstoken commented Jul 20, 2020

Based on conversation regarding --resume functionality in #104

This PR aims to reduce confusion and user error (exceptions, and undetected errors which cause poor training performance) when using the resume training functionality. This PR restricts users to only using --resume as intended.

There are two use cases:

  1. Search for the most recent run/exp* directory, and resume training from there.
python train.py --resume
  1. Resume from a specific unfinished training run. Useful when multiple training runs have been interrupted.
python train.py --resume runs/exp*

Other additions:

  • Checkpoint to warn users when they are attempting to resume an already completed run.
  • Checks input resume dirs for necessary files for resuming. Need 3: opt.yaml, hyp.yaml, last*.pt
  • Ignores all other args if --resume is used. This ensures users do not accidentally interfere with their training scheme.

Here is a colab notebook with some examples.

Happy to adjust implementation further based on discussion.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhanced --resume behavior with checks for completed runs and improved configuration restoration.

📊 Key Changes

  • 🚫 If an attempt is made to resume a finished training run, the run's directory is removed and a warning is issued.
  • ✏️ The help text for the --resume argument now specifies that it should point to an experiment directory, not a .pt file.
  • 🔄 The resumption method in train.py was reworked to:
    • Ensure the provided --resume path points to a valid directory containing necessary files (e.g., opt.yaml, hyp.yaml, and weights).
    • Load and apply options (opt) and hyperparameters (hyp) from the original run.
    • Set the path to the weights file from the last checkpoint of the original run.
  • 📁 The updated behavior creates a better link between the resumed run and its parent, eliminating confusion and potential mistakes when resuming training.

🎯 Purpose & Impact

  • The update prevents users from inadvertently trying to resume a run that has already concluded, which could lead to data loss.
  • By automating the lifting of configurations from previous runs, this PR simplifies the process of resuming, making it more error-proof and user-friendly.
  • Users benefit from a more robust and intuitive resume functionality, increasing the effectiveness of iterative training sessions and saving time during the machine learning model development lifecycle.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Aug 20, 2020
@alexstoken
Copy link
Contributor Author

Solved by #756 .

@alexstoken alexstoken closed this Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale Stale and schedule for closing soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant