Log command line options, hyperparameters, and weights per run in `runs/` #104

alexstoken · 2020-06-16T22:25:34Z

Hello,

This PR adds additional components to and changes the structure of logging to better support multiple experiments. This was inspired by some of the logging done in PyTorch Lightning. I am opening this PR as a starting place to discuss these changes, and am happy to modify the PR once best logging practices are agreed upon.

Additions

writes YAML of command line options opt to same directory as tensorboard log
writes YAML of hyperparameters hyp to same directory as tensorboard log
add argument --hypto pass YAML file of hyperparameters. Update default hyp dictionary with these parameters.

Changes

last.pt and best.pt are saved to the same directory as tensorboard log. Thus, each experimental run will have a last.pt and a best.pt. This takes up more space, so perhaps this should be optional or some logic to delete old runs can be implmented.
get_latest_run() finds most recently changed */last.pt amid all of the files in /runs/* to use as default resume, since weight files are no longer stored in the root dir. Another option would be to store last.pt in the root dir until training is complete, then move it to /runs/*/weights
plot LR scheduler by default, and save this to tensorboard log dir as well

Proposed final dir structure for logging

yolov5
- data
- inference
- models
- utils
- runs
  - 01_01_2020
    - hpy.yaml
    - opt.yaml
    - events.out.file
    - LR.png
    - weights/
    * last.pt
    * best.pt
  - 01_02_2020
    - hpy.yaml
    - opt.yaml
    - events.out.file
    - LR.png
    - weights/
    * last.pt
    * best.pt

Notes

tensorboard log directory is nominally named CURRENT_TIME[--name].
I haven't had the chance to test the changes yet, but wanted to begin a discussion of logging and get some feedback as I work through this. If these types of changes aren't in the scope of PRs at this time, hopefully this can at least get ultralytics thinking about potential updates! I hope to test and do bug fixing on these changes later this week.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhancements in saving directories and hyperparameter configurations for YOLOv5 training and testing.

📊 Key Changes

Added save_dir parameter in test.py and train.py to specify save directories.
glob.glob paths now use Path objects for improved OS compatibility.
Hyperparameters (hyp) dict in train.py now includes the optimizer ('SGD' or 'adam' if specified).
Directories for logging, last checkpoint, and results are now dynamically generated based on TensorBoard log directory.
New function get_latest_run fetches the path to the latest run for resuming training.
Training hyperparameters and options (hyp.yaml and opt.yaml) saved in the run directory.
Labels and learning rate scheduler plots are now saved in corresponding directories.

🎯 Purpose & Impact

📁 Enhanced File Management: Custom save directories allow users to better organize their training and testing outputs, which can be especially useful when running multiple experiments.
🤖 Improved Optimizer Flexibility: With the inclusion of the 'optimizer' hyperparameter, users gain more control over their training by explicitly choosing between SGD and Adam optimizers.
🔄 Smoother Resumption of Training: The get_latest_run function enables users to seamlessly resume training from the most recent checkpoint without manually specifying it.
🌐 Cross-OS Compatibility: Using Path objects for file paths ensures compatibility across different operating systems, reducing potential path-related issues.
🎨 Visualization of Training Dynamics: By saving additional plots related to label distribution and learning rate schedules, users can more easily visualize and understand the dynamics of their training process.

Uncomment plot_lr_scheduler in train() and pass log_dir as save location

added logic in train.py __main__ to handle resuming from a run

Add example of hyp yaml

alexstoken · 2020-06-17T16:05:41Z

Update: I've made some bug fixes, and can verify that the saving with the above dir structure, loading new hyp, and automatically resuming from most recent last file work. Still haven't written tests for these things, so I can't say it's been rigorously tested.

But it works with python 3.7.6, pytorch 1.6

glenn-jocher · 2020-06-17T19:21:20Z

@alexstoken thanks for the PR! This area is definitely in need of some organization, and we've had issues specifically related to this as well #20.

Moving the hyps to their own yaml file was also on my TODO list, so it looks like you've beat me to it. This should make it much easier to organize trainings for different datasets (the current hyps are mainly tuned for coco).

I'll try to look over this this week.

glenn-jocher · 2020-06-17T19:35:37Z

@alexstoken it is also true that the total training settings (i.e. what you would need to completely reproduce a training run) are currently sourced from an awkward combination of the model.yaml, dataset.yaml, hyps and argparser arguments, which makes things a bit confusing of course.

One other quick point is that there are several png's generated during training. They are:

LR.png: planned LR schedule
labels.png: which shows label statistics (class frequency histogram, width-height and xy 2d histograms in normalized image space (0-1).
train_batch0.jpg: label overlays for first train batch (up to 16 images)
train_batch1.jpg: label overlays for second train batch (up to 16 images)
train_batch1.jpg: label overlays for third train batch (up to 16 images)
test_batch0_gt.jpg: ground truth labels for first test batch (up to 16 images)
test_batch0_pred.jpg: predictions for first test batch (up to 16 images)

The default behavior is to generate the 5 once and done, however if the user deletes any during training, they will be regenerated at the next opportune moment, i.e. so you can see the latest test results. Generating the images is slow (i.e. 1s per image), which is why I did not default to creating them every epoch.

The most important point here is that all of these images were created for better introspection capability, but are now a bit disorganized, and could also perhaps use a better system, i.e. a dashboard or better integration with tensorboard for better visibility/usability/discoverability.

Set save_dir = log_dir in train.py

alexstoken · 2020-06-17T21:13:27Z

@glenn-jocher Ah, I didn't realize that issue was related because of the title but they definitely have overlap here. Agreed on the ease of multiple hyp files for different datasets. I think it's ok to leave the included hyps in train.py as a default, but they could also be moved and have the argparser default to a 'default_hyp.yaml or 'coco_hyp.yaml file, as you suggest.

I noticed and adjusted some of those plots/images as well. I'll add some commits changing the save location of the rest. I'm tentatively using the argument save_dir ='.' for the plotting functions in utils.py. In train.py, save_dir = log_dir, when applicable.

I like the idea of moving those to tensorboard or another dashboard. You've already done a great job making all of the other losses/metrics available there, so adding these there as well makes sense.

Also, the --resume-from-run argument is designed to address the use case in #20 . You can specify a particular task to resume from, in case more than one was interrupted. I'm not sure this is as clear as it could be, so I'm interested in any suggestions you may have on that.

glenn-jocher · 2020-06-23T19:08:24Z

@alexstoken was reviewing the PR. Two things here:

Can you revert the coco128.yaml changes? They are pointing to a local directory of yours.
Can you also remove the --resume code? I know a lot of people seem to like this, but in reality it has never worked in either this repo or yolov3, and creates many more problems than it's worth, i.e. (--resume bug #141), especially with the EMA and LR schedules in place.

EDIT: to clarify, I think I'm going to remove all existng --resume functionality as well, not just the additions you proposed.

…with Adam optimizer

alexstoken · 2020-06-24T15:46:51Z

Hi @glenn-jocher ,

Ah, sorry about that, the coco128.yaml changes must've snuck into the commit. Reverted.
All --resume code removed. I agree with and understand your rationale. I also had similar issues with resuming runs. If those bugs are fixed (I had assumed they were due to burn in or some sort of warm up), this PR can always be revisited. To add a visual to --resume bug #141 , this was my experience with multiple `--resume`` sequences:

Additional changes to consider (included in PR, but can be removed):

It sees like hyp contains all of the optimizer, loss, and data augmentation settings except for the optimizer choice itself. So I decided to remove --adam from the argparser, and instead add an optimizer key to hyp. This defaults to SGD (and None will default to SGD).
Adam has adjustable momentum too, in the form of beta1. In source code of official OneCycleLR implementation, PyTorch changes beta1 when present, in lieu of momentum. I've put a rudimentary form of this in place, using the default beta2 hardcoded in.

Potential Changes for this PR or a follow up
As you suggested above, the current mix of yaml and argparse is a bit confusing. However, I think it does make sense to have many of these settings in different yaml files, and it makes it easy to combine them and track different experiments. A potential next step for organization is to have the argparser take just 4 required args --cfg, --data, --hyp, --opt, with optional --name. All of the current arguments could remain as optional overrides to the opt.yaml. This would also let users take advantage of the 'opt.yaml' saves that would begin happening with this PR.

glenn-jocher · 2020-07-09T00:02:21Z

@alexstoken I've made a few updates and merged just now. Hopefully I didn't break anything with my changes. I'll do some checks now.

glenn-jocher · 2020-07-09T00:07:12Z

@alexstoken hey I just thought of something. We have tensorboard tightly integrated here, but do you think there are use cases where people would prefer to train without it? Is it a requirement for pytorch lightning training?

glenn-jocher · 2020-07-09T00:12:38Z

It seems to be working!! I haven't tried to --resume yet, but starting different training runs results in different folders with all the same details. Very nice, good work!!

jundengdeng · 2020-07-09T16:23:37Z

Thanks for the contribution. There may be a mistake to log the parameters in "hyp" as the parameters in "hyp" are modified some times. For example, currently we log "hyp['cls'] *=nc/80" rather than the original hyp['cls']. I would suggest that we move the following lines

# Save run settings
    with open(Path(log_dir) / 'hyp.yaml', 'w') as f:
        yaml.dump(hyp, f, sort_keys=False)
    with open(Path(log_dir) / 'opt.yaml', 'w') as f:
        yaml.dump(vars(opt), f, sort_keys=False)

to line 54 in "train.py"

glenn-jocher · 2020-07-09T17:08:25Z

@jundengdeng yes you are correct! Would you like to submit a PR?

alexstoken · 2020-07-09T20:58:19Z

@glenn-jocher Glad to see it's all working well. Sorry to not have picked up on your formatting when I went through it the first time around. Probably would have been an easier merge if I had.

@jundengdeng good catch! I had it down there specifically to log the updated hyp['cls'], but you're right, if it logs that and then you rerun with --resume, becuase of the *= you'll get a different result. It seems a similar scaling happens to weight decay. @glenn-jocher , this raises a question: do you want to log the final hyperparameters used with the model (as is currently done), or do we want to log the hyperparameters that would properly be sent for a --resume run to be consistent?

I had done the former on purpose for logging because it shows the user what the model actually ran with, but it does create a bug with the --resume functionality. But now that I look closer, it seems that because the adjustments are just scaling, so the true "effective" hyp values that should be preserved are the ones originally passed in. Is this correct?

glenn-jocher · 2020-07-09T21:03:57Z

@alexstoken @jundengdeng we want to log the settings to reproduce a run, so we should ignore the operations that those settings undergo once inside a run, as it is the responsibility of train.py to resume properly given the same initial settings and an epoch number.

This particular operation scales class loss by count, otherwise the loss mean operation will have negative effects for smaller class count datasets as the cls hyp was set for COCO at 80 classes.

No worries about the formatting, I should probably probably make a contribution guide or similar. One of the simplest things you can do is to pass your code through a PEP8 formatter prior to submission. PyCharm makes this very easy for example with Code > Reformat Code.

alexstoken · 2020-07-09T21:09:19Z

@glenn-jocher Makes sense. Do you want a PR to fix the bug or just commit the change yourself? I can do the PR real quick if that's your preference. That way users won't have a bug in resume for long.

glenn-jocher · 2020-07-09T21:10:28Z

@alexstoken sure can you submit a PR?

glenn-jocher · 2020-07-09T22:59:38Z

I've updated the log_dir naming defaults now to make it a bit cleaner in 603ea0b

New runs will be saved to runs/exp1, runs/exp2 etc with optional --name, i.e. runs/exp3_name

glenn-jocher · 2020-07-10T00:57:16Z

@alexstoken @jundengdeng do you guys have any thoughts on integrating hyps dictionaries into data.yaml files?

I've been considering how to extract the hyps from train.py, since I think they should absolutely be seperated, but I'm not sure whether to place the hyps in their own yaml file, as in yolov5/hyps/hyps_coco.yaml etc, or to simply place them directly into the coco.yaml file.

I suppose hyps may vary optimally between models, i.e. yolov5s and yolov5x are very different sizes, its possible each may have optimal and different hyps from each other, but the main differentiator is the dataset in my mind, so it would be reasonable to link a hyps dictionary to a dataset. What do you think?

jundengdeng · 2020-07-10T07:53:16Z

@alexstoken @jundengdeng do you guys have any thoughts on integrating hyps dictionaries into data.yaml files?

I've been considering how to extract the hyps from train.py, since I think they should absolutely be seperated, but I'm not sure whether to place the hyps in their own yaml file, as in yolov5/hyps/hyps_coco.yaml etc, or to simply place them directly into the coco.yaml file.

I suppose hyps may vary optimally between models, i.e. yolov5s and yolov5x are very different sizes, its possible each may have optimal and different hyps from each other, but the main differentiator is the dataset in my mind, so it would be reasonable to link a hyps dictionary to a dataset. What do you think?

Thanks for asking.
What I would prefer is the following:

 
_base_ = './yolov5x_coco.py'

# new learning policy
lr0 = 0.0001

alexstoken · 2020-07-10T17:33:36Z

I've updated the log_dir naming defaults now to make it a bit cleaner in 603ea0b

@glenn-jocher Nice change. I wasn't sure the best way to implement the incrementer but that works well. Much easier to look at/use than the tensorboard default.

#345: I'm realizing this PR didn't take into account any of the tutorials that were based on the other logging structure. Seems this only comes into play in the Visualize section, because the image locations are hard-coded. A similar solution would be to change the paths to runs/exp0/[file]., and then all the visuals should work again. Like so:

>>>>> Old
Image(filename='./train_batch1.jpg', width=900)  # view augmented training mosaics
=======
Image(filename='./runs/exp0_tutorial/train_batch1.jpg', width=900)  # view augmented training mosaics
<<<<< New

It seems like the bug in #345 is similar, and Roboflow should update their tutorial to match the evolving repo (you did mention things were subject to change!)

do you guys have any thoughts on integrating hyps dictionaries into data.yaml files?

I think they should remain separate from the data. Only at the end, after many experiments are run, can an optimal hyp set be determined for a dataset and model. So for the provided benchmarks, you can provide some hyp files, but when a user goes to experiment with their own custom data, tying the hyps and the data together could be problematic since then the data file isn't static and it'll be more confusing to monitor changes in hyp that improved performance.

I think the main modular elements of each training run are the data, the model architecture, learning/optimization scheme, and program arguments (similar to best practices described in PyTorch Lightning - they have 3, plus data). Right now those are distributed across 4 files (data, model, hpy, opt), which I think pretty closely follows best practice. Only change would be to allow a user to pass --opt instead of filling out the args manually. I thought about this for this PR, and maybe allowing the --opt file passed to be overwritten by other args, but to become the default if not overwritten.

glenn-jocher · 2020-07-10T18:47:11Z

@alexstoken yes, the Visualize section of all the tutorials are broken now, needs fixing. You can't make an omelette without cracking a few eggs.

Ok, lets just sleep on this for a few days while I patch up things elsewhere. I think the largest updates have already been implemented and the results seem to be working well, so I feel we've already taken a big step in the right direction!

glenn-jocher · 2020-07-13T18:41:01Z

@alexstoken what do you think about single command resume? For example your VM shuts down, you restart it and continue with just:

python train.py --resume

without having to worry about your exact settings you used to start the run initially. Do we have everything we need saved in opt.yaml etc to do this currently? We would also need to make sure the process is repeatable, that --resuming once doesn't hurt --resuming a second time.

glenn-jocher · 2020-07-13T18:45:50Z

The reason I'm thinking of this BTW is that 90% of the misunderstandings and errors people run into are typically human error. For example I might try to --resume with a different batch size or model, or forgot that I used --single-cls when starting the training, and then not understand why I get an error resuming later without it, or just as bad I get no error but discontinuous results and can't understand why. The less 'manual labor' that is required the less opportunities people have to introduce errors.

alexstoken · 2020-07-13T20:49:40Z

@glenn-jocher great idea, I think that's much more user friendly and will save a ton of headaches.

All of the components are in place, for the most part. get_latest_run() can be modified to search for the opt.yaml file, and that file contains everything needed to resume.

Better yet, the get_latest_run() can just look for the most recent folder, and then take the opt.yaml and last.pt from that folder. Here's a sample, based on our last argparser iterations, including modifications to get_latest_run():

import argparse, os, glob, yaml
from argparse import Namespace

def get_latest_run(search_dir='./runs'):
    # Return path to most recent 'last.pt' in /runs (i.e. to --resume from)
    last_list = glob.glob(f'{search_dir}/*', recursive=False)
    return max(last_list, key=os.path.getctime)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--epochs', type=int, default=300)
    parser.add_argument('--resume', nargs='?', const = 'get_last', default=False, help='resume training from last.pt')
    parser.add_argument('--weights', type=str, default='', help='initial weights path')
    
    opt = parser.parse_args()
    
    if opt.resume:
        last_run_dir = get_latest_run()

        with open(last_run_dir + os.sep + 'opt.yaml', 'r') as f:
            resume_opt = Namespace(**yaml.load(f, Loader=yaml.FullLoader))
        
        resume_opt.weights = last_run_dir + os.sep + 'weights' + os.sep + 'last.pt' 

        print(f'Resuming training from {last_run_dir}')
        opt = resume_opt

    print(opt)

Results in (old runs from before the exp# naming update):

$ python .\argparse_test.py --resume
Resuming training from ./runs\Jul12_17-55-37_JSSDW18050174equip
Namespace(batch_size=50, bucket='', cache_images=True, cfg='.\\models\\yolov5x.yaml', data='.\\data\\external_equ
ipment.yaml', device='', epochs=300, evolve=False, hyp='.\\equip_hyp.yaml', img_size=[384, 384], multi_scale=Fals
e, name='equip', noautoanchor=False, nosave=False, notest=False, rect=True, resume=False, single_cls=False, weigh
ts='./runs\\Jul12_17-55-37_JSSDW18050174equip\\weights\\last.pt')

$ python .\argparse_test.py
Namespace(epochs=300, resume=False, weights='')

alexstoken added 11 commits June 16, 2020 14:53

add parser arg for hyp yaml file

a85e6d0

add save yaml of opt and hyp to tensorboard log_dir in train()

d9f446c

change weights dir (wdir) to be unique to each run, under log_dir

4418809

add save_dir arg to plot_lr_scheduler, default to current dir.

490f1e7

Uncomment plot_lr_scheduler in train() and pass log_dir as save location

add util function to get most recent last.pt file

25e51bc

added logic in train.py __main__ to handle resuming from a run

add logic for resuming and getting hyp for resume run

a448c3b

add update default hyp dict with provided yaml

333f678

remove old print statements

5f2eeba

Fix get_latest_run() to search inside 'weights' subfolders

d342917

Fix get_latest_run to search 'search_dir' recursivly

3263a20

Fix hyp file read in and dict update.

ade023c

Add example of hyp yaml

glenn-jocher mentioned this pull request Jun 17, 2020

Train different tasks at same time #20

Closed

alexstoken added 3 commits June 17, 2020 15:55

Move results.txt from weights/ to log_dir

3b2b330

Add save_dir to plot_lr_scheduler and plot_labels

945307b

Set save_dir = log_dir in train.py

Add save_dir arg to test.test, use arg as location for saving batch jpgs

9b7386f

alexstoken added 2 commits June 17, 2020 16:32

Syntax fixes

c8152c8

Add plot_results save location to log_dir

e572bb0

alexstoken added 5 commits June 24, 2020 09:13

Revert coco128.yaml to initial commit

ccf0af1

Remove --resume functionality and related checks/logic.

d64ad0f

Fix help message for cfg files

7edbf65

Mode all optimizer settings to 'hyp.yaml', integrate proper momentum …

7abf202

…with Adam optimizer

Default optimizer SGD

bc4ef48

alexstoken added 2 commits June 24, 2020 10:49

Turn opt into dictionary before sending it to yaml

611aacf

Delete equip_hyp.yaml

d1ca6f2

glenn-jocher added 6 commits July 8, 2020 16:14

Update utils.py

cbe39a1

Update test.py

c3d3e6b

Update test.py

22ab1c2

Update train.py

6b134d9

Delete new_hyp.yaml

abb024d

Merge branch 'master' into advanced_logging

dc5e183

glenn-jocher merged commit 0fef3f6 into ultralytics:master Jul 9, 2020

alexstoken deleted the advanced_logging branch July 9, 2020 19:16

alexstoken restored the advanced_logging branch July 9, 2020 19:30

alexstoken deleted the advanced_logging branch July 9, 2020 19:50

alexstoken mentioned this pull request Jul 9, 2020

Move hyp and opt yaml save to top of train() #338

Merged

glenn-jocher mentioned this pull request Jul 10, 2020

plot_results does not display outputs; weights.pt saved in different location [bug introduced July 9] #345

Closed

alexstoken mentioned this pull request Jul 20, 2020

Resume with single arg #457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log command line options, hyperparameters, and weights per run in `runs/` #104

Log command line options, hyperparameters, and weights per run in `runs/` #104

alexstoken commented Jun 16, 2020 •

edited by UltralyticsAssistant

Loading

alexstoken commented Jun 17, 2020

glenn-jocher commented Jun 17, 2020

glenn-jocher commented Jun 17, 2020 •

edited

Loading

alexstoken commented Jun 17, 2020 •

edited

Loading

glenn-jocher commented Jun 23, 2020 •

edited

Loading

alexstoken commented Jun 24, 2020

glenn-jocher commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

jundengdeng commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

alexstoken commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

alexstoken commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

glenn-jocher commented Jul 10, 2020 •

edited

Loading

jundengdeng commented Jul 10, 2020

alexstoken commented Jul 10, 2020

glenn-jocher commented Jul 10, 2020

glenn-jocher commented Jul 13, 2020 •

edited

Loading

glenn-jocher commented Jul 13, 2020

alexstoken commented Jul 13, 2020

Log command line options, hyperparameters, and weights per run in runs/ #104

Log command line options, hyperparameters, and weights per run in runs/ #104

Conversation

alexstoken commented Jun 16, 2020 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

alexstoken commented Jun 17, 2020

glenn-jocher commented Jun 17, 2020

glenn-jocher commented Jun 17, 2020 • edited Loading

alexstoken commented Jun 17, 2020 • edited Loading

glenn-jocher commented Jun 23, 2020 • edited Loading

alexstoken commented Jun 24, 2020

glenn-jocher commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

jundengdeng commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

alexstoken commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

alexstoken commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

glenn-jocher commented Jul 10, 2020 • edited Loading

jundengdeng commented Jul 10, 2020

alexstoken commented Jul 10, 2020

glenn-jocher commented Jul 10, 2020

glenn-jocher commented Jul 13, 2020 • edited Loading

glenn-jocher commented Jul 13, 2020

alexstoken commented Jul 13, 2020

Log command line options, hyperparameters, and weights per run in `runs/` #104

Log command line options, hyperparameters, and weights per run in `runs/` #104

alexstoken commented Jun 16, 2020 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Jun 17, 2020 •

edited

Loading

alexstoken commented Jun 17, 2020 •

edited

Loading

glenn-jocher commented Jun 23, 2020 •

edited

Loading

glenn-jocher commented Jul 10, 2020 •

edited

Loading

glenn-jocher commented Jul 13, 2020 •

edited

Loading