Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update expt name comment and folder parsing for training #978

Merged
merged 9 commits into from
Oct 13, 2020

Conversation

Borda
Copy link
Contributor

@Borda Borda commented Sep 16, 2020

updating argument description and cleaning weak parsing indexes from listed folders

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhanced logging details, clarified experiment naming, and refined hyperparameter evolution persistence.

📊 Key Changes

  • Made logger messages more readable by splitting details across multiple lines.
  • Changed the --name argument to rename the experiment folder rather than just the results file.
  • Improved logger information when starting TensorBoard with a formatted string.
  • Evolved hyperparameters are now saved in a path that respects the user-defined log directory.
  • Updated directory incrementing function to use regex for more robust directory numbering.

🎯 Purpose & Impact

  • 💡 Easier to read logs: Logging messages are now more user-friendly.
  • 📁 Clearer naming: Helps users easily identify and organize their experiments based on custom names.
  • 📈 Improved TensorBoard integration: Users get clearer instructions on how to view TensorBoard outputs.
  • 🛠 Flexible hyperparameter evolution storage: Storing evolved hyperparameters in user-defined locations allows for better organization and usage in future training.
  • 🎲 Enhanced directory management: Uses regular expressions for improved reliability in incrementing experiment folders, reducing potential confusion or errors when running multiple experiments.

These changes aim to provide a smoother user experience, with intuitive navigation and better organization of training runs, which is key for users managing multiple experiments and models.

@Borda
Copy link
Contributor Author

Borda commented Sep 16, 2020

just thinking that it would be easier if when comment is passed an there is no such folder yet, the experiment would be named just with the comment...

@glenn-jocher
Copy link
Member

@Borda thanks for the PR! I always like to try to understand use cases better. We made the runs/exp change because the default autogenerated tensorboard directories were quite verbose, including timestamps and device identifiers.

The strategy I've been using is to supply a --log-dir and --name when training, which combine into a unique incremented directory. For example:
python train.py --log runs/voc --name yolov5s_finetune_640

Would log to runs/voc/exp0_yolov5s_finetune_640. I use this on Colab with Google Drive mounted or organize runs:
Screen Shot 2020-09-16 at 1 16 55 PM

Would the change only apply a unique increment if an existing directory with the same name was encountered (or if no --name was supplied)?

@Borda
Copy link
Contributor Author

Borda commented Sep 16, 2020

well, in this PR I didn't do completely what I mentioned in the PR just try to fix some edge case with dir parsing and update argument description...
Yes, optionally skip the exp{N} mind be useful for some folder grouping as you have in your screen-shot, I would rather see all sorted/grouped by model then chronologically by exp hash...

@NanoCode012
Copy link
Contributor

Hi @Borda , I actually find the current naming quite nice.

I would rather see all sorted/grouped by model then chronologically by exp hash...

Incidentally, if you use Tensorboard, you can "Filter runs" and see the visualization for a certain group of experiments. I also find the exp number first being more useful as I can tell chronologically which experiment I've performed first across many without checking time stamp.

It could also just be me building my workflow around the current repo. Feel free to give me your opinion.

@Borda
Copy link
Contributor Author

Borda commented Sep 17, 2020

@NanoCode012 I see, I would do it just optional and anyway it was just a proposal and it is not part of this PR
so let's move the discussion about naming to the Issues as this is just fixing comment and parsing the existing folders...

@Borda Borda changed the title add expt name for training update expt name comment and folder parsing for training Sep 17, 2020
@glenn-jocher
Copy link
Member

@Borda ah, I see. So the scope of the changes is to increase robustness to edge cases without modifying the functionality, I misunderstood before.

I see the argparser explanation fix also, thanks!

@glenn-jocher
Copy link
Member

@Borda was there a specific edge case you had in mind that would cause the current code to fail?

@Borda
Copy link
Contributor Author

Borda commented Sep 18, 2020

@Borda was there a specific edge case you had in mind that would cause the current code to fail?

Yes, the case was that I had other custom folder which doesn't have the exp{n} on the beginning

@Borda
Copy link
Contributor Author

Borda commented Sep 24, 2020

@glenn-jocher gan ve merge this one? :]

@Borda
Copy link
Contributor Author

Borda commented Sep 29, 2020

@glenn-jocher ^^

@Borda
Copy link
Contributor Author

Borda commented Oct 12, 2020

@glenn-jocher ^^

train.py Outdated Show resolved Hide resolved
train.py Outdated
if opt.bucket:
os.system('gsutil cp gs://%s/evolve.txt .' % opt.bucket) # download evolve.txt if exists

for _ in range(300): # generations to evolve
for _ in tqdm(range(300), desc='perform evolve >>'): # generations to evolve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use tqdm here as there are many print/logging statements within each training that it does not handle correctly. I tested evolution on the branch right now to verify the issue. Will remove.

Copy link
Contributor Author

@Borda Borda Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we can and I would recommend writing somewhere which index is actually running moreover the tqdm gives you an estimate of how much longer you need to finish the evolv...
otherwise, you need to very hard way compute single training from an estimate of each epoch and extrapolate to all queued experiment, the worse case it opening the terminal after some time somehow count how many training finished because there no such count... :]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand, unfortunately tqdm is badly behaved when printing within the a tqdm loop. The best way to monitor evolution progress is to look at the yolov5/evolve.txt file, which will show 1 row per generation (sorted by fitness, top row best). hyp.evolve.yaml will also show the best generation index, I should probably update this line to show the best gen / total gen:

# Hyperparameter Evolution Results
# Generations: 306
# P R mAP.5 mAP.5:.95 box obj cls
# Metrics: 0.6 0.936 0.896 0.684 0.0115 0.00805 0.00146

This method also helps you keep track of distributed evolution progress using the example in #607, where multiple single-GPU processes can evolve to the same central evolve.txt and hyp file.

anchor evolution is working correctly now
prefer the single line readout for concise logging, which helps simplify notebook and tutorials etc.
train.py Show resolved Hide resolved
Copy link
Member

@glenn-jocher glenn-jocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Borda was there a specific edge case you had in mind that would cause the current code to fail?

Yes, the case was that I had other custom folder which doesn't have the exp{n} on the beginning

Ok I think I found this note where you described the problem, but I don't quite understand. Currently we have

  • {log_dir}/exp{N}_{name}/
  • {log_dir}/exp{N}/

Can you provide steps to reproduce the problem so I can understand better? Thanks!

EDIT: All other remaining changes look good.

utils/general.py Show resolved Hide resolved
@Borda
Copy link
Contributor Author

Borda commented Oct 12, 2020

Can you provide steps to reproduce the problem so I can understand better? Thanks!

the case as I remember was that if you have some other subfolders in the dir which do not match the experiment pattern like this-is-my_extra-folder so it triggers parsing because of _ but there is not a number so int('some-string') fails

@glenn-jocher
Copy link
Member

Ok looks good! Will merge.

@glenn-jocher glenn-jocher merged commit 00917a6 into ultralytics:master Oct 13, 2020
@Borda Borda deleted the expt-name branch October 13, 2020 13:27
glenn-jocher added a commit that referenced this pull request Oct 15, 2020
* comment

* fix parsing

* fix evolve

* folder

* tqdm

* Update train.py

* Update train.py

* reinstate anchors into meta dict

anchor evolution is working correctly now

* reinstate logger

prefer the single line readout for concise logging, which helps simplify notebook and tutorials etc.

Co-authored-by: Glenn Jocher <[email protected]>
burglarhobbit pushed a commit to burglarhobbit/yolov5 that referenced this pull request Jan 1, 2021
…#978)

* comment

* fix parsing

* fix evolve

* folder

* tqdm

* Update train.py

* Update train.py

* reinstate anchors into meta dict

anchor evolution is working correctly now

* reinstate logger

prefer the single line readout for concise logging, which helps simplify notebook and tutorials etc.

Co-authored-by: Glenn Jocher <[email protected]>
KMint1819 pushed a commit to KMint1819/yolov5 that referenced this pull request May 12, 2021
…#978)

* comment

* fix parsing

* fix evolve

* folder

* tqdm

* Update train.py

* Update train.py

* reinstate anchors into meta dict

anchor evolution is working correctly now

* reinstate logger

prefer the single line readout for concise logging, which helps simplify notebook and tutorials etc.

Co-authored-by: Glenn Jocher <[email protected]>
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
…#978)

* comment

* fix parsing

* fix evolve

* folder

* tqdm

* Update train.py

* Update train.py

* reinstate anchors into meta dict

anchor evolution is working correctly now

* reinstate logger

prefer the single line readout for concise logging, which helps simplify notebook and tutorials etc.

Co-authored-by: Glenn Jocher <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants