use the resume function in training but not save new mAP values #5860

alaa-shubbak · 2021-08-11T07:30:47Z

Dear All ,
I try to use the resume function in training mode , as i want to have more epoch than what i had ,
i did the training first using Faster_RCNN with only 60 epoch , then plot the performance and find the mAP values not enough , so i return back to the model and run this function with resume from latest epoch

python tools/train.py configs/human/my_custom_config.py --gpus 1 --work-dir training_data/faster_rcnn_epoch60 --resume-from training_data/faster_rcnn_epoch60/latest.pth
my problem and question is that , when using resume , i got more epoch saved in my machine , but without saving the value of mAP for validation for such epoch , and i need such value to evaluate the training process and choice the higher one to pick its weight for doing the test mode .

when try to plot the performance using this function with new (.json) file

python tools/analysis_tools/analyze_logs.py plot_curve training_data/faster_rcnn_epoch60/20210811_012551.log.json --keys bbox_mAP --legend mAP_bbox

I got this error massage:

KeyError: 'training_data/faster_rcnn_epoch60/20210811_012551.log.json does not contain metric bbox_mAP'

which is strange ,as i can plot the performance when using the json file of training in first time , but after resume i can not ..
although , i did not change any thing in config file except the max_epoch value to be 120 instead of 60

any help please?

The text was updated successfully, but these errors were encountered:

bommap2810 · 2021-08-11T07:41:39Z

In my opinion, if you want to resume the training process, just fix the config file, at line resume_from = None, resume_from = 'path_to_your_work_dirs/configs/last_epoch.pth'. I used to resume by this way and It was successful.

alaa-shubbak · 2021-08-11T07:48:33Z

In my opinion, if you want to resume the training process, just fix the config file, at line resume_from = None, resume_from = 'path_to_your_work_dirs/configs/last_epoch.pth'. I used to resume by this way and It was successful.

okay , but i have one question,
if not change any thing in config file , how you can tell the system at which epoch the training will stop ?

alaa-shubbak · 2021-08-11T07:52:42Z

@bommap2810

could you please write on the whole instruction you run for that in the linux system ?
such as the instruction of

python tools/analysis_tools/analyze_logs.py plot_curve training_data/faster_rcnn_epoch60/20210811_012551.log.json --keys bbox_mAP --legend mAP_bbox

or shall you write this resume_from in the config file ?

bommap2810 · 2021-08-11T07:56:57Z

@bommap2810

could you please write on the whole instruction you run for that in the linux system ?
such as the instruction of

python tools/analysis_tools/analyze_logs.py plot_curve training_data/faster_rcnn_epoch60/20210811_012551.log.json --keys bbox_mAP --legend mAP_bbox

or shall you write this resume_from in the config file ?

I write this resume_from in the config file and run training syntax again:
python tools/train.py path_to_your_config/config.py

alaa-shubbak · 2021-08-11T07:59:24Z

Many thanks, i will try it .
did you change the number of epoch or not ?

bommap2810 · 2021-08-11T08:00:55Z

Yah I did change the epoch.

alaa-shubbak · 2021-08-11T08:02:34Z

great , now i got it .. many thanks .. i will try it and see.
thank you so much for replay and help

alaa-shubbak · 2021-08-11T22:08:27Z

@bommap2810
hello sir ,
unfortunately , I still have the same problem.
the training continue from the latest epoch , and it saved the values of bbox_mAP (evaluation matrix) in a new log.json file ,
but when i try to plot the curve of such values of mAP versus epoch , it gives me the following error

this is also the new log.json file , it starts from epoch 61 as i resumed the training from there till epoch 100

you may notice that there is a values of mAP from validation dataset present in the black screen and in json file , but it can not be plotted out .. which strange for me, what do you think? do you have any idea ?

have a look also in testing results on one epoch.pth value (which is epoch 90) , it gives me zeros for all mAP values which is not sense.

i am totally confused , any help please ?

Akazfu · 2021-08-13T03:17:15Z

I got the same issue with ‘KeyError: 'work_dirs/yolox_tiny_8x8_300e_elevator/20210812_134209.log.json does not contain metric bbox_mAP'’. However, I'm pretty sure the mAP values are included in the json file. Does anyone solve this?😂

alaa-shubbak · 2021-08-13T09:48:15Z

@Akazfu
what i noticed by doing such training , is that ,when you resume the training from last epoch or old one , it gives you the new mAP in a new json file but starting from the last epoch , not from the 1st epoch , which means you need to merge two json files together in order to plot the whole mAP values in same figure. I did not do yet such merged ,and i don't know if it is really possible.

if any one have an idea/ideas , please share it with us.

Akazfu · 2021-08-17T02:06:55Z

I have read the analyze_logs.py and noticed a metric check at line 53: if metric not in log_dict[epochs[0]]:.
In this case, if you have set up an evaluation interval i, that for every i epoch do the validation, or your log file is discontinued(resuming) as @alaa-shubbak mentioned; You might encounter the KeyError: 'log.json does not contain metric bbox_mAP'
To fix this issue, what I did is to add an arguments parser_plt.add_argument('--interval', type=str, default='1') to the parser, and make sure the x and y for plotting must have the same first dimension;

AronLin · 2021-08-18T03:17:05Z

I will try to check this bug and fix it.

openmmlab-bot assigned AronLin Aug 11, 2021

AronLin added the bug Something isn't working label Aug 18, 2021

AronLin added the community help wanted Extra attention is needed label Aug 18, 2021

ZwwWayne closed this as completed Aug 24, 2021

This was referenced Mar 15, 2022

Enhance the robustness of analyze_logs.py #7399

Closed

Enhance the robustness of analyze_logs.py #7407

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use the resume function in training but not save new mAP values #5860

use the resume function in training but not save new mAP values #5860

alaa-shubbak commented Aug 11, 2021

bommap2810 commented Aug 11, 2021

alaa-shubbak commented Aug 11, 2021 •

edited

Loading

alaa-shubbak commented Aug 11, 2021

bommap2810 commented Aug 11, 2021

alaa-shubbak commented Aug 11, 2021

bommap2810 commented Aug 11, 2021

alaa-shubbak commented Aug 11, 2021

alaa-shubbak commented Aug 11, 2021

Akazfu commented Aug 13, 2021

alaa-shubbak commented Aug 13, 2021 •

edited

Loading

Akazfu commented Aug 17, 2021

AronLin commented Aug 18, 2021

use the resume function in training but not save new mAP values #5860

use the resume function in training but not save new mAP values #5860

Comments

alaa-shubbak commented Aug 11, 2021

bommap2810 commented Aug 11, 2021

alaa-shubbak commented Aug 11, 2021 • edited Loading

alaa-shubbak commented Aug 11, 2021

bommap2810 commented Aug 11, 2021

alaa-shubbak commented Aug 11, 2021

bommap2810 commented Aug 11, 2021

alaa-shubbak commented Aug 11, 2021

alaa-shubbak commented Aug 11, 2021

Akazfu commented Aug 13, 2021

alaa-shubbak commented Aug 13, 2021 • edited Loading

Akazfu commented Aug 17, 2021

AronLin commented Aug 18, 2021

alaa-shubbak commented Aug 11, 2021 •

edited

Loading

alaa-shubbak commented Aug 13, 2021 •

edited

Loading