-
-
Notifications
You must be signed in to change notification settings - Fork 16.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when saving period and parameters #6452
Comments
@Juanjojr9 Hey, the command seems correct. can you check the run that you want to resume and see if the artifact is present there? You can even share the run link here and I'll look into this |
@Juanjojr9 in your command, you have |
Hi @AyushExel , thank you for your answer I think I got the wrong example earlier. Anyway, I did a quick test and now it does save, as you can see in the image below: I cancelled the run to simulate a problem. At first, I thought it was going to work, but it gave me a problem. Analysing it, I realised that it changes the parameters I had set before: My question is: when I run !python train.py --resume wandb-artifact://{crashed_run_path} ,I also have to set the above parameters? Is there another way to save the data locally without losing the information in case the google colab session expires? Thank you |
@Juanjojr9 your syntax is correct. The problem that you're seeing in the pic above is that you're running out of CUDA memory. Read the last line of the error message. |
@AyushExel I have not explained myself well El problema por el que se está quedando sin memoria CUDA is that the parameters are being changed. My question is whether I should run the command by setting the parameters as well, i.e. as follows: If not, I don't understand why it changes the parameters I set when I train the network, when trying to resume the network run |
@Juanjojr9 okay understood. This should not happen. I'll fix this coming week. Thanks for reporting |
I don't know if it's my problem. I will try to keep doing different tests. Thanks to you for creating this incredible and majestic tool. Yours faithfully, |
@Juanjojr9 I've pushed a PR with the fix for this |
@AyushExel Thank you very much for the help and the quick solution. I look forward to using it when it is ready Best regards. JJ |
@Juanjojr9 good news 😃! Your original issue may now be fixed ✅ in PR #6452 by @AyushExel. To receive this update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
@Juanjojr9 Your problem should be fixed in the latest release now the resume command will remember the batch size.. Please verify.. Thanks! |
@AyushExel Hi ! I have been doing different tests and I don't think it's working properly yet. Maybe, I could be wrong. These are my parameters: I interrupted the training and then resumed it. At first I thought it works well, as it starts at the right epoch. But there are some things that don't seem logical to me:
Therefore, the parameters still do not match. I don't know if I'm wrong. |
@Juanjojr9 You might be right about the image size as we're not restoring that.. But can you please check the batch size from your wandb run config? it should be in the overview tab of your wandb run.. The batch size should match before and after resume |
@AyushExel I'm trying to check the batch size after restarting, but I don't know where to see it. |
@Juanjojr9 You're on the right screen. Just scroll down a bit further |
@AyushExel But how do I know if that is the setting before cancelling or after resume? You can see the following: But, you can see this, among other things, that the --imgsz parameter is set to 1280 and --resume is false. Therefore, I believe these configuration parameters are before interrupting the run. They are not the parameters after resuming. What is your opinion? |
I think the batch size is being restored but not the img size..You can still pass those params with the resume flag.. |
@AyushExel In case you were right, in the images of the parameter settings, besides setting the batch size to 1, the imgz parameter should be 640 and not 1280, as the size of the images is not supposed to be saved. With other parameters it would be the same From my humble point of view, I think it doesn't save the parameters properly. As you can see in the screenshots above, when it resumes it creates a new opt.yaml file and you can see what parameters are set and they have nothing to do with the ones you set before. But in case the batch size is saved correctly, it would not be of much use as the other parameters have changed and your original model has not been resumed. Therefore the results would be false and erroneous. |
@Juanjojr9 I think this requires a deeper look. If you look at the PR linked above, you'll see that the hyp dict, batch size and epochs are restored from the run. So if that's not showing up in the experiment, probably it's being overwritten somewhere. I'll take a deeper look again because the cause of the problem seems to be located somewhere else |
@AyushExel Ok, thank you very much for your help. I will be waiting for your answer, as it is a very interesting tool. I also wanted to tell you that the wandb.login() command has been giving me problems in google colab for several days. I don't know if it's my fault. I will keep looking for information. Thank you again |
@Juanjojr9 thanks for reporting. I'll be working on to verify the resume issue this week again. |
@AyushExel When I run the command, it is as if it is not able to start the session, as it takes too long to think and this causes the gogole colab session to crash and restart. |
@Juanjojr9 Okay I investigated the resume further. The batch size was being remembered but image size wasn't.. I've made a PR to remember that.. Currently, here are the params that are remembered when resuming. If there are any more things that need to be remembered please let me know |
@Juanjojr9 good news 😃! Your original issue may now be fixed ✅ in PR #6611 by @AyushExel. To receive this update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
@AyushExel Thank you very much for the help. I'll try it out and see how it works. Best regards. JJ |
Search before asking
Question
Hi, I have searched but have not found a similar question.
I was training a long model, and just in case there was a problem, I set the -save_period parameter, , as you can see in the following picture:
I had a problem and wanted to resume blocked execution, since my google colab session expired and I lost all my session data. , but it gave me another problem.
When I compared it with this tutorial I found: https://colab.research.google.com/github/wandb/examples/blob/master/colabs/yolo/Train_and_Debug_YOLOv5_Models_with_Weights_%26_Biases.ipynb#scrollTo=jwcBfF5OvAHk
In his training it saves the data every epoch and in mine it should do it every 5 epochs, but it doesn't do anything:
It's a real bummer, as the model had been training for several hours.
Does anyone know how to fix this or how can I save the model and the parameters in case I get an error again? Any help is good.
I also wanted to ask if it is possible to download the data locally into a folder instead of using the Weights and Biases website.
Thank you very much for your help.
JJ
Additional
No response
The text was updated successfully, but these errors were encountered: