Is validation loss computed and output ? #310

tetsu-kikuchi · 2020-12-22T04:50:39Z

Thank you for your great work. I'd like to ask you a small question.
While I can find evaluation scores such as mIoU, I cannot find validation loss anywhere (on tensorboard, standard output, log.json etc.).

Is validation loss not computed ?
Is it computed but not output by default (so can I output validation loss somehow by changing config ?)
Is it computed and output but do I simply miss it ?

I used the following config.

python tools/train.py  configs/deeplabv3plus/deeplabv3plus_r50-d8_512x1024_80k_cityscapes.py

I set

workflow = [('train', 10), ('val', 1)]
evaluation = dict(interval=2000, metric='mIoU'),

where 1 epoch = 300 iterations.
Thanks for any help.

The text was updated successfully, but these errors were encountered:

rubeea · 2020-12-22T09:15:43Z

Thank you for your great work. I'd like to ask you a small question.
While I can find evaluation scores such as mIoU, I cannot find validation loss anywhere (on tensorboard, standard output, log.json etc.).

Is validation loss not computed ?

Is it computed but not output by default (so can I output validation loss somehow by changing config ?)

Is it computed and output but do I simply miss it ?

I used the following config.
python tools/train.py  configs/deeplabv3plus/deeplabv3plus_r50-d8_512x1024_80k_cityscapes.py 
I set
workflow = [('train', 10), ('val', 1)]
evaluation = dict(interval=2000, metric='mIoU'), 
where 1 epoch = 300 iterations.
Thanks for any help.

The decode head loss and aux head loss is being calculated on the validation dataset. May be you can find their average to calculate the total loss.

tetsu-kikuchi · 2020-12-22T09:51:04Z

@rubeea
Thank you for your reply. To be honest, I do not understand the detail of DeepLab and the meaning of each loss function.

So, do you mean that the validation loss is implicitly computed in the code, but it is not output anywhere on tensorboard, standard output, log.json etc ? Should I modify a little bit the code so that I can get the value of the validation loss ?

The two kinds of losses you mentioned (decode.loss_seg and aux.loss_seg) appear on tensorboard in the train tab, but I cannot find them in the validation tab. Only evaluation scores (aAcc, mAcc, and mIoU) appear in the validation tab. Possibly I am doing something stupid or misunderstanding something miserably.

I am still confused, but It seems that I should first learn the meaning of the losses and their implementation in the code. Thank you again.

rubeea · 2020-12-24T03:18:19Z

@rubeea
Thank you for your reply. To be honest, I do not understand the detail of DeepLab and the meaning of each loss function.

So, do you mean that the validation loss is implicitly computed in the code, but it is not output anywhere on tensorboard, standard output, log.json etc ? Should I modify a little bit the code so that I can get the value of the validation loss ?

The two kinds of losses you mentioned (decode.loss_seg and aux.loss_seg) appear on tensorboard in the train tab, but I cannot find them in the validation tab. Only evaluation scores (aAcc, mAcc, and mIoU) appear in the validation tab. Possibly I am doing something stupid or misunderstanding something miserably.

I am still confused, but It seems that I should first learn the meaning of the losses and their implementation in the code. Thank you again.

Hi,

Actually you are right those are indeed the training data losses while the metrics are being computed on the validation dataset. Kindly report the solution here if you find a workaround. Thanks :)

rubeea · 2021-01-04T15:05:10Z

@rubeea
Thank you for your reply. To be honest, I do not understand the detail of DeepLab and the meaning of each loss function.

So, do you mean that the validation loss is implicitly computed in the code, but it is not output anywhere on tensorboard, standard output, log.json etc ? Should I modify a little bit the code so that I can get the value of the validation loss ?

The two kinds of losses you mentioned (decode.loss_seg and aux.loss_seg) appear on tensorboard in the train tab, but I cannot find them in the validation tab. Only evaluation scores (aAcc, mAcc, and mIoU) appear in the validation tab. Possibly I am doing something stupid or misunderstanding something miserably.

I am still confused, but It seems that I should first learn the meaning of the losses and their implementation in the code. Thank you again.

Hi,

As per the mmsegmantation docs(https://mmsegmentation.readthedocs.io/en/latest/tutorials/customize_runtime.html), the validation loss can be calculated by setting the workflow to:

[('train', 1), ('val', 1)] instead of just [('train', 1)]. However, I get a dataloaders error when I attempt to set the workflow to [('train', 1)]. Do you meet a similar error as well? If yes, any idea on how it can be resolved?

tetsu-kikuchi · 2021-01-10T16:00:37Z

Hello @rubeea Sorry for late reply. I have been crazily busy this week.

Thank you for your comment. Yes, I have changed the workflow to include val.
Unfortunately, I have not encountered a similar problem when setting [('train', 1)].
[('train', 1)] and [('train', 1), ('val', 1)] both worked in my case.

For the original problem in this issue, that is, to output validation loss in the tensorboard,
today I found a workaround.
In mmseg/models/segmentors/base.py, validation loss is calculated in def val_step.
https://github.com/open-mmlab/mmsegmentation/blob/master/mmseg/models/segmentors/base.py#L162

To show loss in tensorboard, we need the key 'log_vars' in the output dictionary. This key exists in train output (in def train_step), but not in val output. That is why the val loss is not shown in tensorboard, I suppose.
So, I simply mimic the def train_step. I added the following after output = self(**data_batch, **kwargs) in def val_step.

    loss, log_vars = self._parse_losses(output)

    import collections
    log_vars_val = collections.OrderedDict()
    for k,v in log_vars.items():
        new_key = 'val_' + k
        log_vars_val[new_key] = v

    output = dict(
        loss=loss,
        log_vars=log_vars_val,
        num_samples=len(data_batch['img'].data))

I slightly changed the name by adding prefix 'val_' in the keys, otherwise I think the val loss is not distinguished from train loss in the tensorboard. In my case, this workaround worked and the val loss is shown on the tensorboard. (One unsatisfactory point is that the val loss is shown in 'train' tab... This is ugly but is not a problem practically.)
I hope this helps.

tetsu-kikuchi · 2021-01-11T04:35:53Z

I have not yet fully understand how to handle validation loss. But a workaround to show validation loss is found as above, so let me close this issue.
I would appreciate if you comment it here if you have seen that validation loss was properly shown on tensorboard by default setting (i.e., by only changing the config files, without modifying the code itself).

rubeea · 2021-01-14T08:19:32Z

I have not yet fully understand how to handle validation loss. But a workaround to show validation loss is found as above, so let me close this issue.
I would appreciate if you comment it here if you have seen that validation loss was properly shown on tensorboard by default setting (i.e., by only changing the config files, without modifying the code itself).

Thank you for your comments and help. I'll definitely post if I find a way around to display the losses in a separate validation lab.

rubeea · 2021-02-11T16:10:06Z

Hi @tetsu-kikuchi,

Did you encounter this error when using the workflow as workflow= [('train', 1), ('val', 1)]. If not can you kindly share your config file here so that I can understand what mistake I am making.

Error:
Traceback (most recent call last)
in ()
31 mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
32 train_segmentor(model, datasets, cfg, distributed=False, validate=True,
---> 33 meta=meta)

7 frames
/content/pldu_mmsegmentation/mmseg/apis/train.py in train_segmentor(model, dataset, cfg, distributed, validate, timestamp, meta)
114 elif cfg.load_from:
115 runner.load_checkpoint(cfg.load_from)
--> 116 runner.run(data_loaders, cfg.workflow)

/usr/local/lib/python3.6/dist-packages/mmcv/runner/iter_based_runner.py in run(self, data_loaders, workflow, max_iters, **kwargs)
128 if mode == 'train' and self.iter >= self._max_iters:
129 break
--> 130 iter_runner(iter_loaders[i], **kwargs)
131
132 time.sleep(1) # wait for some hooks like loggers to finish

/usr/local/lib/python3.6/dist-packages/mmcv/runner/iter_based_runner.py in val(self, data_loader, **kwargs)
74 self.call_hook('before_val_iter')
75 data_batch = next(data_loader)
---> 76 outputs = self.model.val_step(data_batch, **kwargs)
77 if not isinstance(outputs, dict):
78 raise TypeError('model.val_step() must return a dict')

/usr/local/lib/python3.6/dist-packages/mmcv/parallel/data_parallel.py in val_step(self, *inputs, **kwargs)
87
88 inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
---> 89 return self.module.val_step(*inputs[0], **kwargs[0])

/content/pldu_mmsegmentation/mmseg/models/segmentors/base.py in val_step(self, data_batch, **kwargs)
167 not implemented with this method, but an evaluation hook.
168 """
--> 169 output = self(**data_batch, **kwargs)
170 # loss, log_vars = self._parse_losses(output)
171 # log_vars_val = OrderedDict()

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/mmcv/runner/fp16_utils.py in new_func(*args, **kwargs)
82 'method of nn.Module')
83 if not (hasattr(args[0], 'fp16_enabled') and args[0].fp16_enabled):
---> 84 return old_func(*args, **kwargs)
85 # get the arg spec of the decorated method
86 args_info = getfullargspec(old_func)

/content/pldu_mmsegmentation/mmseg/models/segmentors/base.py in forward(self, img, img_metas, return_loss, **kwargs)
120 """
121 if return_loss:
--> 122 return self.forward_train(img, img_metas, **kwargs)
123 else:
124 return self.forward_test(img, img_metas, **kwargs)

TypeError: forward_train() missing 1 required positional argument: 'gt_semantic_seg'

Thanks in advance for your help.

tetsu-kikuchi · 2021-02-11T17:00:44Z

Hi @rubeea
I do not encounter the same error, whether or not when I set workflow= [('train', 1), ('val', 1)].
My configs are the following. I use Deeplabv3plus.

python tools/train.py configs/deeplabv3plus/deeplabv3plus_r50-d8_512x1024_80k_cityscapes.py --load-from checkpoints/deeplabv3plus_r50-d8_512x1024_80k_cityscapes_20200606_114049-f9fb496d.pth

configs/base/models/deeplabv3plus_r50-d8.py

norm_cfg = dict(type='BN', requires_grad=True)
model = dict(
    type='EncoderDecoder',
    pretrained='open-mmlab://resnet50_v1c',
    backbone=dict(
        type='ResNetV1c',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        dilations=(1, 1, 2, 4),
        strides=(1, 2, 1, 1),
        norm_cfg=norm_cfg,
        norm_eval=False,
        style='pytorch',
        contract_dilation=True),
    decode_head=dict(
        sampler=dict(type='OHEMPixelSampler', thresh=0.75, min_kept=int(512*1024/8)),
        type='DepthwiseSeparableASPPHead',
        in_channels=2048,
        in_index=3,
        channels=512,
        dilations=(1, 12, 24, 36),
        c1_in_channels=256,
        c1_channels=48,
        dropout_ratio=0.1,
        num_classes=5, ###19,
        norm_cfg=norm_cfg,
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        sampler=dict(type='OHEMPixelSampler', thresh=0.8, min_kept=int(512*1024/6)),
        type='FCNHead',
        in_channels=1024,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=5, ###19,
        norm_cfg=norm_cfg,
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)))

train_cfg = dict()
test_cfg = dict(mode='whole')

configs/base/schedules/schedule_80k.py

optimizer = dict(type='SGD', lr=0.01*(2/8)/5, momentum=0.9, weight_decay=0.0005)

optimizer_config = dict()

lr_config = dict(policy='poly', power=0.9, min_lr=2.*1e-5, by_epoch=False)

runner = dict(type='IterBasedRunner', max_iters=200000)

checkpoint_config = dict(by_epoch=False, interval=5000)
evaluation = dict(interval=5000, metric='mIoU')

configs/base/default_runtime.py

log_config = dict(
    interval=400,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=False),
        dict(type='TensorboardLoggerHook')   
    ])

dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
#workflow = [('train', 1)]
workflow = [('train', 1), ('val', 1)]   ###
cudnn_benchmark =  True

configs/base/datasets/cityscapes.py

dataset_type = 'CityscapesDataset'
data_root = 'data/cityscapes'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 1024)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(1024, 512), keep_ratio=False, ratio_range=(1.0, 1.2), interpolation='lanczos'),
    dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=1.0),
    dict(type='RandomFlip', prob=0.5),
    dict(type='myRotate', rot_prob=0.5, degree_range=[-2,2]),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion', brightness_delta=0),
    dict(type='Normalize', lightness_standarization=True, **img_norm_cfg),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024,512),
        flip=False,
        transforms=[
            dict(type='Resize', img_scale=(1024,512), multiscale_mode='value', keep_ratio=False, interpolation='lanczos'),  
            dict(type='RandomFlip'),
            dict(type='Normalize', lightness_standarization=True, **img_norm_cfg),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=2, 
    workers_per_gpu=1, 
    train=dict(
        type=dataset_type,
        data_root=data_root,
        img_dir='leftImg8bit/train',
        ann_dir='gtFine/train',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        data_root=data_root,
        img_dir='leftImg8bit/val',
        ann_dir='gtFine/val',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        data_root=data_root,
        img_dir='leftImg8bit/val',
        ann_dir='gtFine/val',
        pipeline=test_pipeline))

Note that I have customized some of the codes.

By the way, your error message indicates that 'gt_semantic_seg' is not properly read.
Please put print(data_batch) in def val_step in mmseg/models/segmentors/base.py, and see whether the key 'gt_semantic_seg' is in the data_batch. In my case, there is the key 'gt_semantic_seg' in the data_batch.
I wonder if your dataset or dataloader is possibly problematic.

rubeea · 2021-02-12T09:40:19Z

Hey @tetsu-kikuchi
Yes I already printed data_batch and realized that 'gt_semantic_seg' key is not in it but I believe that is because the validation pipeline does not contain the operation dict(type='LoadAnnotations') just as the train pipeline and that is why no annotations are loaded. May I know which functions do you use to build the datasets and train the model? I am using

datasets = [build_dataset(cfg.data.train), build_dataset(cfg.data.val)]

for building the datasets for the workflow= [('train'),('val')] and

train_segmentor in mmseg/apis/train.py for training as follows:

train_segmentor(model, datasets, cfg, distributed=False, validate=True, 
                meta=meta)

Moreover, if validate=True in the above function and the workflow is set to workflow=[('train')] only, I believe the statistics (loss, accuracy etc.) are calculated on the validation dataset and not the train dataset. Is that correct? Because if this is the case then I don't think there is a need to change the workflow to [('train'),('val')].

Thanks in advance.

tetsu-kikuchi · 2021-02-14T11:38:06Z

Hi @rubeea
Both for building datasets and training models, I use the default code:

https://github.com/open-mmlab/mmsegmentation/blob/master/tools/train.py#L137

    datasets = [build_dataset(cfg.data.train)]
    if len(cfg.workflow) == 2:
        val_dataset = copy.deepcopy(cfg.data.val)
        val_dataset.pipeline = cfg.data.train.pipeline
        datasets.append(build_dataset(val_dataset))

https://github.com/open-mmlab/mmsegmentation/blob/master/tools/train.py#L152

    train_segmentor(
        model,
        datasets,
        cfg,
        distributed=distributed,
        validate=(not args.no_validate),
        timestamp=timestamp,
        meta=meta)

(To be precise, I downloaded the code on the last December, and use it by a slight customization for myself)

Moreover, if validate=True in the above function and the workflow is set to workflow=[('train')] only, I believe the statistics (loss, accuracy etc.) are calculated on the validation dataset and not the train dataset. Is that correct?

Sorry, I rarely set workflow=[('train')] only, so I do not know this point well.

rubeea · 2021-02-14T12:18:48Z

Hi @tetsu-kikuchi,

Thanks for your response. As suspected, you are using the train pipeline for the validation dataset as well
val_dataset.pipeline = cfg.data.train.pipeline
and that is why you have the 'gt_semantic_seg' keys in your data batch. However, I have noticed that setting the workflow=[('train')] with validate True option and
val_dataset.pipeline = cfg.data.val.pipeline
indeed runs the evaluation hook on the validation dataset during the entire training schedule. So, I guess it is the same.
Thanks for providing the explanations.

* Use ONNX / Core ML compatible method to broadcast. Unfortunately `tile` could not be used either, it's still not compatible with ONNX. See open-mmlab#284. * Add comment about why broadcast_to is not used. Also, apply style to changed files. * Make sure broadcast remains in same device.

* resolve comments * update changelog * remove redundant code * update

tetsu-kikuchi closed this as completed Jan 11, 2021

Qiqi-0810 mentioned this issue Apr 15, 2021

Assertion Error report when I use cfg.workflow = [('train', 1), ('val', 1)] #483

Closed

mfernezir mentioned this issue Mar 18, 2022

Logging validation loss without library code hacks #1396

Closed

wshunli mentioned this issue Aug 31, 2022

Validation Loss During Training #2002

Closed

sibozhang pushed a commit to sibozhang/mmsegmentation that referenced this issue Mar 22, 2024

Remove redundant code (open-mmlab#310)

8b6ccd3

* resolve comments * update changelog * remove redundant code * update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is validation loss computed and output ? #310

Is validation loss computed and output ? #310

tetsu-kikuchi commented Dec 22, 2020 •

edited

Loading

rubeea commented Dec 22, 2020

tetsu-kikuchi commented Dec 22, 2020 •

edited

Loading

rubeea commented Dec 24, 2020

rubeea commented Jan 4, 2021

tetsu-kikuchi commented Jan 10, 2021 •

edited

Loading

tetsu-kikuchi commented Jan 11, 2021 •

edited

Loading

rubeea commented Jan 14, 2021

rubeea commented Feb 11, 2021

tetsu-kikuchi commented Feb 11, 2021 •

edited

Loading

rubeea commented Feb 12, 2021

tetsu-kikuchi commented Feb 14, 2021 •

edited

Loading

rubeea commented Feb 14, 2021

Is validation loss computed and output ? #310

Is validation loss computed and output ? #310

Comments

tetsu-kikuchi commented Dec 22, 2020 • edited Loading

rubeea commented Dec 22, 2020

tetsu-kikuchi commented Dec 22, 2020 • edited Loading

rubeea commented Dec 24, 2020

rubeea commented Jan 4, 2021

tetsu-kikuchi commented Jan 10, 2021 • edited Loading

tetsu-kikuchi commented Jan 11, 2021 • edited Loading

rubeea commented Jan 14, 2021

rubeea commented Feb 11, 2021

tetsu-kikuchi commented Feb 11, 2021 • edited Loading

rubeea commented Feb 12, 2021

tetsu-kikuchi commented Feb 14, 2021 • edited Loading

rubeea commented Feb 14, 2021

tetsu-kikuchi commented Dec 22, 2020 •

edited

Loading

tetsu-kikuchi commented Dec 22, 2020 •

edited

Loading

tetsu-kikuchi commented Jan 10, 2021 •

edited

Loading

tetsu-kikuchi commented Jan 11, 2021 •

edited

Loading

tetsu-kikuchi commented Feb 11, 2021 •

edited

Loading

tetsu-kikuchi commented Feb 14, 2021 •

edited

Loading