Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log command line options, hyperparameters, and weights per run in runs/ #104

Merged
merged 40 commits into from
Jul 9, 2020
Merged

Log command line options, hyperparameters, and weights per run in runs/ #104

merged 40 commits into from
Jul 9, 2020

Conversation

alexstoken
Copy link
Contributor

@alexstoken alexstoken commented Jun 16, 2020

Hello,

This PR adds additional components to and changes the structure of logging to better support multiple experiments. This was inspired by some of the logging done in PyTorch Lightning. I am opening this PR as a starting place to discuss these changes, and am happy to modify the PR once best logging practices are agreed upon.

Additions

  • writes YAML of command line options opt to same directory as tensorboard log
  • writes YAML of hyperparameters hyp to same directory as tensorboard log
  • add argument --hypto pass YAML file of hyperparameters. Update default hyp dictionary with these parameters.

Changes

  • last.pt and best.pt are saved to the same directory as tensorboard log. Thus, each experimental run will have a last.pt and a best.pt. This takes up more space, so perhaps this should be optional or some logic to delete old runs can be implmented.
  • get_latest_run() finds most recently changed */last.pt amid all of the files in /runs/* to use as default resume, since weight files are no longer stored in the root dir. Another option would be to store last.pt in the root dir until training is complete, then move it to /runs/*/weights
  • plot LR scheduler by default, and save this to tensorboard log dir as well

Proposed final dir structure for logging

  • yolov5
    • data
    • inference
    • models
    • utils
    • runs
      • 01_01_2020
        - hpy.yaml
        - opt.yaml
        - events.out.file
        - LR.png
        - weights/
        * last.pt
        * best.pt
      • 01_02_2020
        - hpy.yaml
        - opt.yaml
        - events.out.file
        - LR.png
        - weights/
        * last.pt
        * best.pt

Notes

  • tensorboard log directory is nominally named CURRENT_TIME[--name].
  • I haven't had the chance to test the changes yet, but wanted to begin a discussion of logging and get some feedback as I work through this. If these types of changes aren't in the scope of PRs at this time, hopefully this can at least get ultralytics thinking about potential updates! I hope to test and do bug fixing on these changes later this week.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhancements in saving directories and hyperparameter configurations for YOLOv5 training and testing.

📊 Key Changes

  • Added save_dir parameter in test.py and train.py to specify save directories.
  • glob.glob paths now use Path objects for improved OS compatibility.
  • Hyperparameters (hyp) dict in train.py now includes the optimizer ('SGD' or 'adam' if specified).
  • Directories for logging, last checkpoint, and results are now dynamically generated based on TensorBoard log directory.
  • New function get_latest_run fetches the path to the latest run for resuming training.
  • Training hyperparameters and options (hyp.yaml and opt.yaml) saved in the run directory.
  • Labels and learning rate scheduler plots are now saved in corresponding directories.

🎯 Purpose & Impact

  • 📁 Enhanced File Management: Custom save directories allow users to better organize their training and testing outputs, which can be especially useful when running multiple experiments.
  • 🤖 Improved Optimizer Flexibility: With the inclusion of the 'optimizer' hyperparameter, users gain more control over their training by explicitly choosing between SGD and Adam optimizers.
  • 🔄 Smoother Resumption of Training: The get_latest_run function enables users to seamlessly resume training from the most recent checkpoint without manually specifying it.
  • 🌐 Cross-OS Compatibility: Using Path objects for file paths ensures compatibility across different operating systems, reducing potential path-related issues.
  • 🎨 Visualization of Training Dynamics: By saving additional plots related to label distribution and learning rate schedules, users can more easily visualize and understand the dynamics of their training process.

@alexstoken
Copy link
Contributor Author

Update: I've made some bug fixes, and can verify that the saving with the above dir structure, loading new hyp, and automatically resuming from most recent last file work. Still haven't written tests for these things, so I can't say it's been rigorously tested.

But it works with python 3.7.6, pytorch 1.6

@glenn-jocher
Copy link
Member

@alexstoken thanks for the PR! This area is definitely in need of some organization, and we've had issues specifically related to this as well #20.

Moving the hyps to their own yaml file was also on my TODO list, so it looks like you've beat me to it. This should make it much easier to organize trainings for different datasets (the current hyps are mainly tuned for coco).

I'll try to look over this this week.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 17, 2020

@alexstoken it is also true that the total training settings (i.e. what you would need to completely reproduce a training run) are currently sourced from an awkward combination of the model.yaml, dataset.yaml, hyps and argparser arguments, which makes things a bit confusing of course.

One other quick point is that there are several png's generated during training. They are:

  • LR.png: planned LR schedule
  • labels.png: which shows label statistics (class frequency histogram, width-height and xy 2d histograms in normalized image space (0-1).
  • train_batch0.jpg: label overlays for first train batch (up to 16 images)
  • train_batch1.jpg: label overlays for second train batch (up to 16 images)
  • train_batch1.jpg: label overlays for third train batch (up to 16 images)
  • test_batch0_gt.jpg: ground truth labels for first test batch (up to 16 images)
  • test_batch0_pred.jpg: predictions for first test batch (up to 16 images)

The default behavior is to generate the 5 once and done, however if the user deletes any during training, they will be regenerated at the next opportune moment, i.e. so you can see the latest test results. Generating the images is slow (i.e. 1s per image), which is why I did not default to creating them every epoch.

The most important point here is that all of these images were created for better introspection capability, but are now a bit disorganized, and could also perhaps use a better system, i.e. a dashboard or better integration with tensorboard for better visibility/usability/discoverability.

@alexstoken
Copy link
Contributor Author

alexstoken commented Jun 17, 2020

@glenn-jocher Ah, I didn't realize that issue was related because of the title but they definitely have overlap here. Agreed on the ease of multiple hyp files for different datasets. I think it's ok to leave the included hyps in train.py as a default, but they could also be moved and have the argparser default to a 'default_hyp.yaml or 'coco_hyp.yaml file, as you suggest.

I noticed and adjusted some of those plots/images as well. I'll add some commits changing the save location of the rest. I'm tentatively using the argument save_dir ='.' for the plotting functions in utils.py. In train.py, save_dir = log_dir, when applicable.

I like the idea of moving those to tensorboard or another dashboard. You've already done a great job making all of the other losses/metrics available there, so adding these there as well makes sense.

Also, the --resume-from-run argument is designed to address the use case in #20 . You can specify a particular task to resume from, in case more than one was interrupted. I'm not sure this is as clear as it could be, so I'm interested in any suggestions you may have on that.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 23, 2020

@alexstoken was reviewing the PR. Two things here:

  1. Can you revert the coco128.yaml changes? They are pointing to a local directory of yours.
  2. Can you also remove the --resume code? I know a lot of people seem to like this, but in reality it has never worked in either this repo or yolov3, and creates many more problems than it's worth, i.e. (--resume bug #141), especially with the EMA and LR schedules in place.

EDIT: to clarify, I think I'm going to remove all existng --resume functionality as well, not just the additions you proposed.

@alexstoken
Copy link
Contributor Author

Hi @glenn-jocher ,

  1. Ah, sorry about that, the coco128.yaml changes must've snuck into the commit. Reverted.
  2. All --resume code removed. I agree with and understand your rationale. I also had similar issues with resuming runs. If those bugs are fixed (I had assumed they were due to burn in or some sort of warm up), this PR can always be revisited. To add a visual to --resume bug #141 , this was my experience with multiple `--resume`` sequences:
    image
    image

Additional changes to consider (included in PR, but can be removed):

  1. It sees like hyp contains all of the optimizer, loss, and data augmentation settings except for the optimizer choice itself. So I decided to remove --adam from the argparser, and instead add an optimizer key to hyp. This defaults to SGD (and None will default to SGD).
  2. Adam has adjustable momentum too, in the form of beta1. In source code of official OneCycleLR implementation, PyTorch changes beta1 when present, in lieu of momentum. I've put a rudimentary form of this in place, using the default beta2 hardcoded in.

Potential Changes for this PR or a follow up
As you suggested above, the current mix of yaml and argparse is a bit confusing. However, I think it does make sense to have many of these settings in different yaml files, and it makes it easy to combine them and track different experiments. A potential next step for organization is to have the argparser take just 4 required args --cfg, --data, --hyp, --opt, with optional --name. All of the current arguments could remain as optional overrides to the opt.yaml. This would also let users take advantage of the 'opt.yaml' saves that would begin happening with this PR.

@glenn-jocher glenn-jocher merged commit 0fef3f6 into ultralytics:master Jul 9, 2020
@glenn-jocher
Copy link
Member

@alexstoken I've made a few updates and merged just now. Hopefully I didn't break anything with my changes. I'll do some checks now.

@glenn-jocher
Copy link
Member

@alexstoken hey I just thought of something. We have tensorboard tightly integrated here, but do you think there are use cases where people would prefer to train without it? Is it a requirement for pytorch lightning training?

@glenn-jocher
Copy link
Member

It seems to be working!! I haven't tried to --resume yet, but starting different training runs results in different folders with all the same details. Very nice, good work!!

Screen Shot 2020-07-08 at 5 12 15 PM

@jundengdeng
Copy link

Thanks for the contribution. There may be a mistake to log the parameters in "hyp" as the parameters in "hyp" are modified some times. For example, currently we log "hyp['cls'] *=nc/80" rather than the original hyp['cls']. I would suggest that we move the following lines

# Save run settings
    with open(Path(log_dir) / 'hyp.yaml', 'w') as f:
        yaml.dump(hyp, f, sort_keys=False)
    with open(Path(log_dir) / 'opt.yaml', 'w') as f:
        yaml.dump(vars(opt), f, sort_keys=False)

to line 54 in "train.py"

@glenn-jocher
Copy link
Member

@jundengdeng yes you are correct! Would you like to submit a PR?

@alexstoken alexstoken deleted the advanced_logging branch July 9, 2020 19:16
@alexstoken alexstoken restored the advanced_logging branch July 9, 2020 19:30
@alexstoken alexstoken deleted the advanced_logging branch July 9, 2020 19:50
@alexstoken
Copy link
Contributor Author

@glenn-jocher Glad to see it's all working well. Sorry to not have picked up on your formatting when I went through it the first time around. Probably would have been an easier merge if I had.

@jundengdeng good catch! I had it down there specifically to log the updated hyp['cls'], but you're right, if it logs that and then you rerun with --resume, becuase of the *= you'll get a different result. It seems a similar scaling happens to weight decay. @glenn-jocher , this raises a question: do you want to log the final hyperparameters used with the model (as is currently done), or do we want to log the hyperparameters that would properly be sent for a --resume run to be consistent?

I had done the former on purpose for logging because it shows the user what the model actually ran with, but it does create a bug with the --resume functionality. But now that I look closer, it seems that because the adjustments are just scaling, so the true "effective" hyp values that should be preserved are the ones originally passed in. Is this correct?

@glenn-jocher
Copy link
Member

@alexstoken @jundengdeng we want to log the settings to reproduce a run, so we should ignore the operations that those settings undergo once inside a run, as it is the responsibility of train.py to resume properly given the same initial settings and an epoch number.

This particular operation scales class loss by count, otherwise the loss mean operation will have negative effects for smaller class count datasets as the cls hyp was set for COCO at 80 classes.

No worries about the formatting, I should probably probably make a contribution guide or similar. One of the simplest things you can do is to pass your code through a PEP8 formatter prior to submission. PyCharm makes this very easy for example with Code > Reformat Code.

@alexstoken
Copy link
Contributor Author

@glenn-jocher Makes sense. Do you want a PR to fix the bug or just commit the change yourself? I can do the PR real quick if that's your preference. That way users won't have a bug in resume for long.

@glenn-jocher
Copy link
Member

@alexstoken sure can you submit a PR?

@glenn-jocher
Copy link
Member

I've updated the log_dir naming defaults now to make it a bit cleaner in 603ea0b

New runs will be saved to runs/exp1, runs/exp2 etc with optional --name, i.e. runs/exp3_name

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 10, 2020

@alexstoken @jundengdeng do you guys have any thoughts on integrating hyps dictionaries into data.yaml files?

I've been considering how to extract the hyps from train.py, since I think they should absolutely be seperated, but I'm not sure whether to place the hyps in their own yaml file, as in yolov5/hyps/hyps_coco.yaml etc, or to simply place them directly into the coco.yaml file.

I suppose hyps may vary optimally between models, i.e. yolov5s and yolov5x are very different sizes, its possible each may have optimal and different hyps from each other, but the main differentiator is the dataset in my mind, so it would be reasonable to link a hyps dictionary to a dataset. What do you think?

@jundengdeng
Copy link

@alexstoken @jundengdeng do you guys have any thoughts on integrating hyps dictionaries into data.yaml files?

I've been considering how to extract the hyps from train.py, since I think they should absolutely be seperated, but I'm not sure whether to place the hyps in their own yaml file, as in yolov5/hyps/hyps_coco.yaml etc, or to simply place them directly into the coco.yaml file.

I suppose hyps may vary optimally between models, i.e. yolov5s and yolov5x are very different sizes, its possible each may have optimal and different hyps from each other, but the main differentiator is the dataset in my mind, so it would be reasonable to link a hyps dictionary to a dataset. What do you think?

Thanks for asking.
What I would prefer is the following:

 
_base_ = './yolov5x_coco.py'

# new learning policy
lr0 = 0.0001

@alexstoken
Copy link
Contributor Author

I've updated the log_dir naming defaults now to make it a bit cleaner in 603ea0b

@glenn-jocher Nice change. I wasn't sure the best way to implement the incrementer but that works well. Much easier to look at/use than the tensorboard default.

#345: I'm realizing this PR didn't take into account any of the tutorials that were based on the other logging structure. Seems this only comes into play in the Visualize section, because the image locations are hard-coded. A similar solution would be to change the paths to runs/exp0/[file]., and then all the visuals should work again. Like so:

>>>>> Old
Image(filename='./train_batch1.jpg', width=900)  # view augmented training mosaics
=======
Image(filename='./runs/exp0_tutorial/train_batch1.jpg', width=900)  # view augmented training mosaics
<<<<< New

It seems like the bug in #345 is similar, and Roboflow should update their tutorial to match the evolving repo (you did mention things were subject to change!)

do you guys have any thoughts on integrating hyps dictionaries into data.yaml files?

I think they should remain separate from the data. Only at the end, after many experiments are run, can an optimal hyp set be determined for a dataset and model. So for the provided benchmarks, you can provide some hyp files, but when a user goes to experiment with their own custom data, tying the hyps and the data together could be problematic since then the data file isn't static and it'll be more confusing to monitor changes in hyp that improved performance.

I think the main modular elements of each training run are the data, the model architecture, learning/optimization scheme, and program arguments (similar to best practices described in PyTorch Lightning - they have 3, plus data). Right now those are distributed across 4 files (data, model, hpy, opt), which I think pretty closely follows best practice. Only change would be to allow a user to pass --opt instead of filling out the args manually. I thought about this for this PR, and maybe allowing the --opt file passed to be overwritten by other args, but to become the default if not overwritten.

@glenn-jocher
Copy link
Member

@alexstoken yes, the Visualize section of all the tutorials are broken now, needs fixing. You can't make an omelette without cracking a few eggs.

Ok, lets just sleep on this for a few days while I patch up things elsewhere. I think the largest updates have already been implemented and the results seem to be working well, so I feel we've already taken a big step in the right direction!

@glenn-jocher
Copy link
Member

glenn-jocher commented Jul 13, 2020

@alexstoken what do you think about single command resume? For example your VM shuts down, you restart it and continue with just:

python train.py --resume

without having to worry about your exact settings you used to start the run initially. Do we have everything we need saved in opt.yaml etc to do this currently? We would also need to make sure the process is repeatable, that --resuming once doesn't hurt --resuming a second time.

@glenn-jocher
Copy link
Member

The reason I'm thinking of this BTW is that 90% of the misunderstandings and errors people run into are typically human error. For example I might try to --resume with a different batch size or model, or forgot that I used --single-cls when starting the training, and then not understand why I get an error resuming later without it, or just as bad I get no error but discontinuous results and can't understand why. The less 'manual labor' that is required the less opportunities people have to introduce errors.

@alexstoken
Copy link
Contributor Author

@glenn-jocher great idea, I think that's much more user friendly and will save a ton of headaches.

All of the components are in place, for the most part. get_latest_run() can be modified to search for the opt.yaml file, and that file contains everything needed to resume.

Better yet, the get_latest_run() can just look for the most recent folder, and then take the opt.yaml and last.pt from that folder. Here's a sample, based on our last argparser iterations, including modifications to get_latest_run():

import argparse, os, glob, yaml
from argparse import Namespace

def get_latest_run(search_dir='./runs'):
    # Return path to most recent 'last.pt' in /runs (i.e. to --resume from)
    last_list = glob.glob(f'{search_dir}/*', recursive=False)
    return max(last_list, key=os.path.getctime)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--epochs', type=int, default=300)
    parser.add_argument('--resume', nargs='?', const = 'get_last', default=False, help='resume training from last.pt')
    parser.add_argument('--weights', type=str, default='', help='initial weights path')
    
    opt = parser.parse_args()
    
    if opt.resume:
        last_run_dir = get_latest_run()

        with open(last_run_dir + os.sep + 'opt.yaml', 'r') as f:
            resume_opt = Namespace(**yaml.load(f, Loader=yaml.FullLoader))
        
        resume_opt.weights = last_run_dir + os.sep + 'weights' + os.sep + 'last.pt' 

        print(f'Resuming training from {last_run_dir}')
        opt = resume_opt

    print(opt)

Results in (old runs from before the exp# naming update):

$ python .\argparse_test.py --resume
Resuming training from ./runs\Jul12_17-55-37_JSSDW18050174equip
Namespace(batch_size=50, bucket='', cache_images=True, cfg='.\\models\\yolov5x.yaml', data='.\\data\\external_equ
ipment.yaml', device='', epochs=300, evolve=False, hyp='.\\equip_hyp.yaml', img_size=[384, 384], multi_scale=Fals
e, name='equip', noautoanchor=False, nosave=False, notest=False, rect=True, resume=False, single_cls=False, weigh
ts='./runs\\Jul12_17-55-37_JSSDW18050174equip\\weights\\last.pt')

$ python .\argparse_test.py
Namespace(epochs=300, resume=False, weights='')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants