Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use the resume function in training but not save new mAP values #5860

Closed
alaa-shubbak opened this issue Aug 11, 2021 · 12 comments
Closed

use the resume function in training but not save new mAP values #5860

alaa-shubbak opened this issue Aug 11, 2021 · 12 comments
Assignees
Labels
bug Something isn't working community help wanted Extra attention is needed

Comments

@alaa-shubbak
Copy link

Dear All ,
I try to use the resume function in training mode , as i want to have more epoch than what i had ,
i did the training first using Faster_RCNN with only 60 epoch , then plot the performance and find the mAP values not enough , so i return back to the model and run this function with resume from latest epoch

python tools/train.py configs/human/my_custom_config.py --gpus 1 --work-dir training_data/faster_rcnn_epoch60 --resume-from training_data/faster_rcnn_epoch60/latest.pth
my problem and question is that , when using resume , i got more epoch saved in my machine , but without saving the value of mAP for validation for such epoch , and i need such value to evaluate the training process and choice the higher one to pick its weight for doing the test mode .

when try to plot the performance using this function with new (.json) file

python tools/analysis_tools/analyze_logs.py plot_curve training_data/faster_rcnn_epoch60/20210811_012551.log.json --keys bbox_mAP --legend mAP_bbox

I got this error massage:

KeyError: 'training_data/faster_rcnn_epoch60/20210811_012551.log.json does not contain metric bbox_mAP'

which is strange ,as i can plot the performance when using the json file of training in first time , but after resume i can not ..
although , i did not change any thing in config file except the max_epoch value to be 120 instead of 60

any help please?

@bommap2810
Copy link

In my opinion, if you want to resume the training process, just fix the config file, at line resume_from = None, resume_from = 'path_to_your_work_dirs/configs/last_epoch.pth'. I used to resume by this way and It was successful.

@alaa-shubbak
Copy link
Author

alaa-shubbak commented Aug 11, 2021

In my opinion, if you want to resume the training process, just fix the config file, at line resume_from = None, resume_from = 'path_to_your_work_dirs/configs/last_epoch.pth'. I used to resume by this way and It was successful.

okay , but i have one question,
if not change any thing in config file , how you can tell the system at which epoch the training will stop ?

@alaa-shubbak
Copy link
Author

@bommap2810

could you please write on the whole instruction you run for that in the linux system ?
such as the instruction of

python tools/analysis_tools/analyze_logs.py plot_curve training_data/faster_rcnn_epoch60/20210811_012551.log.json --keys bbox_mAP --legend mAP_bbox

or shall you write this resume_from in the config file ?

@bommap2810
Copy link

@bommap2810

could you please write on the whole instruction you run for that in the linux system ?
such as the instruction of

python tools/analysis_tools/analyze_logs.py plot_curve training_data/faster_rcnn_epoch60/20210811_012551.log.json --keys bbox_mAP --legend mAP_bbox

or shall you write this resume_from in the config file ?

I write this resume_from in the config file and run training syntax again:
python tools/train.py path_to_your_config/config.py

@alaa-shubbak
Copy link
Author

Many thanks, i will try it .
did you change the number of epoch or not ?

@bommap2810
Copy link

Yah I did change the epoch.

@alaa-shubbak
Copy link
Author

great , now i got it .. many thanks .. i will try it and see.
thank you so much for replay and help

@alaa-shubbak
Copy link
Author

@bommap2810
hello sir ,
unfortunately , I still have the same problem.
the training continue from the latest epoch , and it saved the values of bbox_mAP (evaluation matrix) in a new log.json file ,
but when i try to plot the curve of such values of mAP versus epoch , it gives me the following error

centernet error after plot curve function

this is also the new log.json file , it starts from epoch 61 as i resumed the training from there till epoch 100
json file info for centernet

you may notice that there is a values of mAP from validation dataset present in the black screen and in json file , but it can not be plotted out .. which strange for me, what do you think? do you have any idea ?

have a look also in testing results on one epoch.pth value (which is epoch 90) , it gives me zeros for all mAP values which is not sense.
test _centernet in epoch90

i am totally confused , any help please ?

@Akazfu
Copy link

Akazfu commented Aug 13, 2021

I got the same issue with ‘KeyError: 'work_dirs/yolox_tiny_8x8_300e_elevator/20210812_134209.log.json does not contain metric bbox_mAP'’. However, I'm pretty sure the mAP values are included in the json file. Does anyone solve this?😂

@alaa-shubbak
Copy link
Author

alaa-shubbak commented Aug 13, 2021

@Akazfu
what i noticed by doing such training , is that ,when you resume the training from last epoch or old one , it gives you the new mAP in a new json file but starting from the last epoch , not from the 1st epoch , which means you need to merge two json files together in order to plot the whole mAP values in same figure. I did not do yet such merged ,and i don't know if it is really possible.

if any one have an idea/ideas , please share it with us.

@Akazfu
Copy link

Akazfu commented Aug 17, 2021

I have read the analyze_logs.py and noticed a metric check at line 53: if metric not in log_dict[epochs[0]]:.
In this case, if you have set up an evaluation interval i, that for every i epoch do the validation, or your log file is discontinued(resuming) as @alaa-shubbak mentioned; You might encounter the KeyError: 'log.json does not contain metric bbox_mAP'
To fix this issue, what I did is to add an arguments parser_plt.add_argument('--interval', type=str, default='1') to the parser, and make sure the x and y for plotting must have the same first dimension;
1629165873(1)

@AronLin AronLin added the bug Something isn't working label Aug 18, 2021
@AronLin
Copy link
Contributor

AronLin commented Aug 18, 2021

I will try to check this bug and fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants