-
-
Notifications
You must be signed in to change notification settings - Fork 16.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log command line options, hyperparameters, and weights per run in runs/
#104
Conversation
Uncomment plot_lr_scheduler in train() and pass log_dir as save location
added logic in train.py __main__ to handle resuming from a run
Add example of hyp yaml
Update: I've made some bug fixes, and can verify that the saving with the above dir structure, loading new hyp, and automatically resuming from most recent last file work. Still haven't written tests for these things, so I can't say it's been rigorously tested. But it works with |
@alexstoken thanks for the PR! This area is definitely in need of some organization, and we've had issues specifically related to this as well #20. Moving the hyps to their own yaml file was also on my TODO list, so it looks like you've beat me to it. This should make it much easier to organize trainings for different datasets (the current hyps are mainly tuned for coco). I'll try to look over this this week. |
@alexstoken it is also true that the total training settings (i.e. what you would need to completely reproduce a training run) are currently sourced from an awkward combination of the model.yaml, dataset.yaml, hyps and argparser arguments, which makes things a bit confusing of course. One other quick point is that there are several png's generated during training. They are:
The default behavior is to generate the 5 once and done, however if the user deletes any during training, they will be regenerated at the next opportune moment, i.e. so you can see the latest test results. Generating the images is slow (i.e. 1s per image), which is why I did not default to creating them every epoch. The most important point here is that all of these images were created for better introspection capability, but are now a bit disorganized, and could also perhaps use a better system, i.e. a dashboard or better integration with tensorboard for better visibility/usability/discoverability. |
@glenn-jocher Ah, I didn't realize that issue was related because of the title but they definitely have overlap here. Agreed on the ease of multiple hyp files for different datasets. I think it's ok to leave the included hyps in I noticed and adjusted some of those plots/images as well. I'll add some commits changing the save location of the rest. I'm tentatively using the argument I like the idea of moving those to tensorboard or another dashboard. You've already done a great job making all of the other losses/metrics available there, so adding these there as well makes sense. Also, the |
@alexstoken was reviewing the PR. Two things here:
EDIT: to clarify, I think I'm going to remove all existng --resume functionality as well, not just the additions you proposed. |
Hi @glenn-jocher ,
Additional changes to consider (included in PR, but can be removed):
Potential Changes for this PR or a follow up |
@alexstoken I've made a few updates and merged just now. Hopefully I didn't break anything with my changes. I'll do some checks now. |
@alexstoken hey I just thought of something. We have tensorboard tightly integrated here, but do you think there are use cases where people would prefer to train without it? Is it a requirement for pytorch lightning training? |
Thanks for the contribution. There may be a mistake to log the parameters in "hyp" as the parameters in "hyp" are modified some times. For example, currently we log "hyp['cls'] *=nc/80" rather than the original hyp['cls']. I would suggest that we move the following lines # Save run settings
with open(Path(log_dir) / 'hyp.yaml', 'w') as f:
yaml.dump(hyp, f, sort_keys=False)
with open(Path(log_dir) / 'opt.yaml', 'w') as f:
yaml.dump(vars(opt), f, sort_keys=False) to line 54 in "train.py" |
@jundengdeng yes you are correct! Would you like to submit a PR? |
@glenn-jocher Glad to see it's all working well. Sorry to not have picked up on your formatting when I went through it the first time around. Probably would have been an easier merge if I had. @jundengdeng good catch! I had it down there specifically to log the updated I had done the former on purpose for logging because it shows the user what the model actually ran with, but it does create a bug with the |
@alexstoken @jundengdeng we want to log the settings to reproduce a run, so we should ignore the operations that those settings undergo once inside a run, as it is the responsibility of train.py to resume properly given the same initial settings and an epoch number. This particular operation scales class loss by count, otherwise the loss mean operation will have negative effects for smaller class count datasets as the cls hyp was set for COCO at 80 classes. No worries about the formatting, I should probably probably make a contribution guide or similar. One of the simplest things you can do is to pass your code through a PEP8 formatter prior to submission. PyCharm makes this very easy for example with Code > Reformat Code. |
@glenn-jocher Makes sense. Do you want a PR to fix the bug or just commit the change yourself? I can do the PR real quick if that's your preference. That way users won't have a bug in resume for long. |
@alexstoken sure can you submit a PR? |
I've updated the log_dir naming defaults now to make it a bit cleaner in 603ea0b New runs will be saved to runs/exp1, runs/exp2 etc with optional --name, i.e. runs/exp3_name |
@alexstoken @jundengdeng do you guys have any thoughts on integrating hyps dictionaries into data.yaml files? I've been considering how to extract the hyps from train.py, since I think they should absolutely be seperated, but I'm not sure whether to place the hyps in their own yaml file, as in yolov5/hyps/hyps_coco.yaml etc, or to simply place them directly into the coco.yaml file. I suppose hyps may vary optimally between models, i.e. yolov5s and yolov5x are very different sizes, its possible each may have optimal and different hyps from each other, but the main differentiator is the dataset in my mind, so it would be reasonable to link a hyps dictionary to a dataset. What do you think? |
Thanks for asking.
_base_ = './yolov5x_coco.py'
# new learning policy
lr0 = 0.0001 |
@glenn-jocher Nice change. I wasn't sure the best way to implement the incrementer but that works well. Much easier to look at/use than the tensorboard default. #345: I'm realizing this PR didn't take into account any of the tutorials that were based on the other logging structure. Seems this only comes into play in the Visualize section, because the image locations are hard-coded. A similar solution would be to change the paths to >>>>> Old
Image(filename='./train_batch1.jpg', width=900) # view augmented training mosaics
=======
Image(filename='./runs/exp0_tutorial/train_batch1.jpg', width=900) # view augmented training mosaics
<<<<< New It seems like the bug in #345 is similar, and Roboflow should update their tutorial to match the evolving repo (you did mention things were subject to change!)
I think they should remain separate from the data. Only at the end, after many experiments are run, can an optimal hyp set be determined for a dataset and model. So for the provided benchmarks, you can provide some hyp files, but when a user goes to experiment with their own custom data, tying the hyps and the data together could be problematic since then the data file isn't static and it'll be more confusing to monitor changes in hyp that improved performance. I think the main modular elements of each training run are the data, the model architecture, learning/optimization scheme, and program arguments (similar to best practices described in PyTorch Lightning - they have 3, plus data). Right now those are distributed across 4 files (data, model, hpy, opt), which I think pretty closely follows best practice. Only change would be to allow a user to pass |
@alexstoken yes, the Visualize section of all the tutorials are broken now, needs fixing. You can't make an omelette without cracking a few eggs. Ok, lets just sleep on this for a few days while I patch up things elsewhere. I think the largest updates have already been implemented and the results seem to be working well, so I feel we've already taken a big step in the right direction! |
@alexstoken what do you think about single command resume? For example your VM shuts down, you restart it and continue with just:
without having to worry about your exact settings you used to start the run initially. Do we have everything we need saved in opt.yaml etc to do this currently? We would also need to make sure the process is repeatable, that --resuming once doesn't hurt --resuming a second time. |
The reason I'm thinking of this BTW is that 90% of the misunderstandings and errors people run into are typically human error. For example I might try to --resume with a different batch size or model, or forgot that I used --single-cls when starting the training, and then not understand why I get an error resuming later without it, or just as bad I get no error but discontinuous results and can't understand why. The less 'manual labor' that is required the less opportunities people have to introduce errors. |
@glenn-jocher great idea, I think that's much more user friendly and will save a ton of headaches. All of the components are in place, for the most part. Better yet, the import argparse, os, glob, yaml
from argparse import Namespace
def get_latest_run(search_dir='./runs'):
# Return path to most recent 'last.pt' in /runs (i.e. to --resume from)
last_list = glob.glob(f'{search_dir}/*', recursive=False)
return max(last_list, key=os.path.getctime)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--epochs', type=int, default=300)
parser.add_argument('--resume', nargs='?', const = 'get_last', default=False, help='resume training from last.pt')
parser.add_argument('--weights', type=str, default='', help='initial weights path')
opt = parser.parse_args()
if opt.resume:
last_run_dir = get_latest_run()
with open(last_run_dir + os.sep + 'opt.yaml', 'r') as f:
resume_opt = Namespace(**yaml.load(f, Loader=yaml.FullLoader))
resume_opt.weights = last_run_dir + os.sep + 'weights' + os.sep + 'last.pt'
print(f'Resuming training from {last_run_dir}')
opt = resume_opt
print(opt) Results in (old runs from before the exp# naming update):
|
Hello,
This PR adds additional components to and changes the structure of logging to better support multiple experiments. This was inspired by some of the logging done in PyTorch Lightning. I am opening this PR as a starting place to discuss these changes, and am happy to modify the PR once best logging practices are agreed upon.
Additions
opt
to same directory as tensorboard loghyp
to same directory as tensorboard log--hyp
to pass YAML file of hyperparameters. Update defaulthyp
dictionary with these parameters.Changes
last.pt
andbest.pt
are saved to the same directory as tensorboard log. Thus, each experimental run will have alast.pt
and abest.pt
. This takes up more space, so perhaps this should be optional or some logic to delete old runs can be implmented.get_latest_run()
finds most recently changed*/last.pt
amid all of the files in/runs/*
to use as default resume, since weight files are no longer stored in the root dir. Another option would be to storelast.pt
in the root dir until training is complete, then move it to/runs/*/weights
Proposed final dir structure for logging
- hpy.yaml
- opt.yaml
- events.out.file
- LR.png
- weights/
* last.pt
* best.pt
- hpy.yaml
- opt.yaml
- events.out.file
- LR.png
- weights/
* last.pt
* best.pt
Notes
CURRENT_TIME[--name]
.🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Enhancements in saving directories and hyperparameter configurations for YOLOv5 training and testing.
📊 Key Changes
save_dir
parameter intest.py
andtrain.py
to specify save directories.glob.glob
paths now usePath
objects for improved OS compatibility.hyp
) dict intrain.py
now includes the optimizer ('SGD'
or'adam'
if specified).get_latest_run
fetches the path to the latest run for resuming training.hyp.yaml
andopt.yaml
) saved in the run directory.🎯 Purpose & Impact
SGD
andAdam
optimizers.get_latest_run
function enables users to seamlessly resume training from the most recent checkpoint without manually specifying it.Path
objects for file paths ensures compatibility across different operating systems, reducing potential path-related issues.