Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is validation loss computed and output ? #310

Closed
tetsu-kikuchi opened this issue Dec 22, 2020 · 12 comments
Closed

Is validation loss computed and output ? #310

tetsu-kikuchi opened this issue Dec 22, 2020 · 12 comments

Comments

@tetsu-kikuchi
Copy link

tetsu-kikuchi commented Dec 22, 2020

Thank you for your great work. I'd like to ask you a small question.
While I can find evaluation scores such as mIoU, I cannot find validation loss anywhere (on tensorboard, standard output, log.json etc.).

  • Is validation loss not computed ?
  • Is it computed but not output by default (so can I output validation loss somehow by changing config ?)
  • Is it computed and output but do I simply miss it ?

I used the following config.

python tools/train.py  configs/deeplabv3plus/deeplabv3plus_r50-d8_512x1024_80k_cityscapes.py 

I set

workflow = [('train', 10), ('val', 1)]
evaluation = dict(interval=2000, metric='mIoU'), 

where 1 epoch = 300 iterations.
Thanks for any help.

@rubeea
Copy link

rubeea commented Dec 22, 2020

Thank you for your great work. I'd like to ask you a small question.
While I can find evaluation scores such as mIoU, I cannot find validation loss anywhere (on tensorboard, standard output, log.json etc.).

  • Is validation loss not computed ?
  • Is it computed but not output by default (so can I output validation loss somehow by changing config ?)
  • Is it computed and output but do I simply miss it ?

I used the following config.

python tools/train.py  configs/deeplabv3plus/deeplabv3plus_r50-d8_512x1024_80k_cityscapes.py 

I set

workflow = [('train', 10), ('val', 1)]
evaluation = dict(interval=2000, metric='mIoU'), 

where 1 epoch = 300 iterations.
Thanks for any help.

The decode head loss and aux head loss is being calculated on the validation dataset. May be you can find their average to calculate the total loss.

@tetsu-kikuchi
Copy link
Author

tetsu-kikuchi commented Dec 22, 2020

@rubeea
Thank you for your reply. To be honest, I do not understand the detail of DeepLab and the meaning of each loss function.

So, do you mean that the validation loss is implicitly computed in the code, but it is not output anywhere on tensorboard, standard output, log.json etc ? Should I modify a little bit the code so that I can get the value of the validation loss ?

The two kinds of losses you mentioned (decode.loss_seg and aux.loss_seg) appear on tensorboard in the train tab, but I cannot find them in the validation tab. Only evaluation scores (aAcc, mAcc, and mIoU) appear in the validation tab. Possibly I am doing something stupid or misunderstanding something miserably.

I am still confused, but It seems that I should first learn the meaning of the losses and their implementation in the code. Thank you again.

@rubeea
Copy link

rubeea commented Dec 24, 2020

@rubeea
Thank you for your reply. To be honest, I do not understand the detail of DeepLab and the meaning of each loss function.

So, do you mean that the validation loss is implicitly computed in the code, but it is not output anywhere on tensorboard, standard output, log.json etc ? Should I modify a little bit the code so that I can get the value of the validation loss ?

The two kinds of losses you mentioned (decode.loss_seg and aux.loss_seg) appear on tensorboard in the train tab, but I cannot find them in the validation tab. Only evaluation scores (aAcc, mAcc, and mIoU) appear in the validation tab. Possibly I am doing something stupid or misunderstanding something miserably.

I am still confused, but It seems that I should first learn the meaning of the losses and their implementation in the code. Thank you again.

Hi,

Actually you are right those are indeed the training data losses while the metrics are being computed on the validation dataset. Kindly report the solution here if you find a workaround. Thanks :)

@rubeea
Copy link

rubeea commented Jan 4, 2021

@rubeea
Thank you for your reply. To be honest, I do not understand the detail of DeepLab and the meaning of each loss function.

So, do you mean that the validation loss is implicitly computed in the code, but it is not output anywhere on tensorboard, standard output, log.json etc ? Should I modify a little bit the code so that I can get the value of the validation loss ?

The two kinds of losses you mentioned (decode.loss_seg and aux.loss_seg) appear on tensorboard in the train tab, but I cannot find them in the validation tab. Only evaluation scores (aAcc, mAcc, and mIoU) appear in the validation tab. Possibly I am doing something stupid or misunderstanding something miserably.

I am still confused, but It seems that I should first learn the meaning of the losses and their implementation in the code. Thank you again.

Hi,

As per the mmsegmantation docs(https://mmsegmentation.readthedocs.io/en/latest/tutorials/customize_runtime.html), the validation loss can be calculated by setting the workflow to:

[('train', 1), ('val', 1)] instead of just [('train', 1)]. However, I get a dataloaders error when I attempt to set the workflow to [('train', 1)]. Do you meet a similar error as well? If yes, any idea on how it can be resolved?

@tetsu-kikuchi
Copy link
Author

tetsu-kikuchi commented Jan 10, 2021

Hello @rubeea Sorry for late reply. I have been crazily busy this week.

Thank you for your comment. Yes, I have changed the workflow to include val.
Unfortunately, I have not encountered a similar problem when setting [('train', 1)].
[('train', 1)] and [('train', 1), ('val', 1)] both worked in my case.

For the original problem in this issue, that is, to output validation loss in the tensorboard,
today I found a workaround.
In mmseg/models/segmentors/base.py, validation loss is calculated in def val_step.
https://github.com/open-mmlab/mmsegmentation/blob/master/mmseg/models/segmentors/base.py#L162

To show loss in tensorboard, we need the key 'log_vars' in the output dictionary. This key exists in train output (in def train_step), but not in val output. That is why the val loss is not shown in tensorboard, I suppose.
So, I simply mimic the def train_step. I added the following after output = self(**data_batch, **kwargs) in def val_step.

    loss, log_vars = self._parse_losses(output)

    import collections
    log_vars_val = collections.OrderedDict()
    for k,v in log_vars.items():
        new_key = 'val_' + k
        log_vars_val[new_key] = v

    output = dict(
        loss=loss,
        log_vars=log_vars_val,
        num_samples=len(data_batch['img'].data))

I slightly changed the name by adding prefix 'val_' in the keys, otherwise I think the val loss is not distinguished from train loss in the tensorboard. In my case, this workaround worked and the val loss is shown on the tensorboard. (One unsatisfactory point is that the val loss is shown in 'train' tab... This is ugly but is not a problem practically.)
I hope this helps.

@tetsu-kikuchi
Copy link
Author

tetsu-kikuchi commented Jan 11, 2021

I have not yet fully understand how to handle validation loss. But a workaround to show validation loss is found as above, so let me close this issue.
I would appreciate if you comment it here if you have seen that validation loss was properly shown on tensorboard by default setting (i.e., by only changing the config files, without modifying the code itself).

@rubeea
Copy link

rubeea commented Jan 14, 2021

I have not yet fully understand how to handle validation loss. But a workaround to show validation loss is found as above, so let me close this issue.
I would appreciate if you comment it here if you have seen that validation loss was properly shown on tensorboard by default setting (i.e., by only changing the config files, without modifying the code itself).

Thank you for your comments and help. I'll definitely post if I find a way around to display the losses in a separate validation lab.

@rubeea
Copy link

rubeea commented Feb 11, 2021

Hi @tetsu-kikuchi,

Did you encounter this error when using the workflow as workflow= [('train', 1), ('val', 1)]. If not can you kindly share your config file here so that I can understand what mistake I am making.

Error:
Traceback (most recent call last)
in ()
31 mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
32 train_segmentor(model, datasets, cfg, distributed=False, validate=True,
---> 33 meta=meta)

7 frames
/content/pldu_mmsegmentation/mmseg/apis/train.py in train_segmentor(model, dataset, cfg, distributed, validate, timestamp, meta)
114 elif cfg.load_from:
115 runner.load_checkpoint(cfg.load_from)
--> 116 runner.run(data_loaders, cfg.workflow)

/usr/local/lib/python3.6/dist-packages/mmcv/runner/iter_based_runner.py in run(self, data_loaders, workflow, max_iters, **kwargs)
128 if mode == 'train' and self.iter >= self._max_iters:
129 break
--> 130 iter_runner(iter_loaders[i], **kwargs)
131
132 time.sleep(1) # wait for some hooks like loggers to finish

/usr/local/lib/python3.6/dist-packages/mmcv/runner/iter_based_runner.py in val(self, data_loader, **kwargs)
74 self.call_hook('before_val_iter')
75 data_batch = next(data_loader)
---> 76 outputs = self.model.val_step(data_batch, **kwargs)
77 if not isinstance(outputs, dict):
78 raise TypeError('model.val_step() must return a dict')

/usr/local/lib/python3.6/dist-packages/mmcv/parallel/data_parallel.py in val_step(self, *inputs, **kwargs)
87
88 inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
---> 89 return self.module.val_step(*inputs[0], **kwargs[0])

/content/pldu_mmsegmentation/mmseg/models/segmentors/base.py in val_step(self, data_batch, **kwargs)
167 not implemented with this method, but an evaluation hook.
168 """
--> 169 output = self(**data_batch, **kwargs)
170 # loss, log_vars = self._parse_losses(output)
171 # log_vars_val = OrderedDict()

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/mmcv/runner/fp16_utils.py in new_func(*args, **kwargs)
82 'method of nn.Module')
83 if not (hasattr(args[0], 'fp16_enabled') and args[0].fp16_enabled):
---> 84 return old_func(*args, **kwargs)
85 # get the arg spec of the decorated method
86 args_info = getfullargspec(old_func)

/content/pldu_mmsegmentation/mmseg/models/segmentors/base.py in forward(self, img, img_metas, return_loss, **kwargs)
120 """
121 if return_loss:
--> 122 return self.forward_train(img, img_metas, **kwargs)
123 else:
124 return self.forward_test(img, img_metas, **kwargs)

TypeError: forward_train() missing 1 required positional argument: 'gt_semantic_seg'

Thanks in advance for your help.

@tetsu-kikuchi
Copy link
Author

tetsu-kikuchi commented Feb 11, 2021

Hi @rubeea
I do not encounter the same error, whether or not when I set workflow= [('train', 1), ('val', 1)].
My configs are the following. I use Deeplabv3plus.

python tools/train.py configs/deeplabv3plus/deeplabv3plus_r50-d8_512x1024_80k_cityscapes.py --load-from checkpoints/deeplabv3plus_r50-d8_512x1024_80k_cityscapes_20200606_114049-f9fb496d.pth

configs/base/models/deeplabv3plus_r50-d8.py

norm_cfg = dict(type='BN', requires_grad=True)
model = dict(
    type='EncoderDecoder',
    pretrained='open-mmlab://resnet50_v1c',
    backbone=dict(
        type='ResNetV1c',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        dilations=(1, 1, 2, 4),
        strides=(1, 2, 1, 1),
        norm_cfg=norm_cfg,
        norm_eval=False,
        style='pytorch',
        contract_dilation=True),
    decode_head=dict(
        sampler=dict(type='OHEMPixelSampler', thresh=0.75, min_kept=int(512*1024/8)),
        type='DepthwiseSeparableASPPHead',
        in_channels=2048,
        in_index=3,
        channels=512,
        dilations=(1, 12, 24, 36),
        c1_in_channels=256,
        c1_channels=48,
        dropout_ratio=0.1,
        num_classes=5, ###19,
        norm_cfg=norm_cfg,
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        sampler=dict(type='OHEMPixelSampler', thresh=0.8, min_kept=int(512*1024/6)),
        type='FCNHead',
        in_channels=1024,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=5, ###19,
        norm_cfg=norm_cfg,
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)))

train_cfg = dict()
test_cfg = dict(mode='whole')

configs/base/schedules/schedule_80k.py

optimizer = dict(type='SGD', lr=0.01*(2/8)/5, momentum=0.9, weight_decay=0.0005)

optimizer_config = dict()

lr_config = dict(policy='poly', power=0.9, min_lr=2.*1e-5, by_epoch=False)

runner = dict(type='IterBasedRunner', max_iters=200000)

checkpoint_config = dict(by_epoch=False, interval=5000)
evaluation = dict(interval=5000, metric='mIoU')

configs/base/default_runtime.py

log_config = dict(
    interval=400,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=False),
        dict(type='TensorboardLoggerHook')   
    ])

dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
#workflow = [('train', 1)]
workflow = [('train', 1), ('val', 1)]   ###
cudnn_benchmark =  True

configs/base/datasets/cityscapes.py

dataset_type = 'CityscapesDataset'
data_root = 'data/cityscapes'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 1024)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(1024, 512), keep_ratio=False, ratio_range=(1.0, 1.2), interpolation='lanczos'),
    dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=1.0),
    dict(type='RandomFlip', prob=0.5),
    dict(type='myRotate', rot_prob=0.5, degree_range=[-2,2]),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion', brightness_delta=0),
    dict(type='Normalize', lightness_standarization=True, **img_norm_cfg),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024,512),
        flip=False,
        transforms=[
            dict(type='Resize', img_scale=(1024,512), multiscale_mode='value', keep_ratio=False, interpolation='lanczos'),  
            dict(type='RandomFlip'),
            dict(type='Normalize', lightness_standarization=True, **img_norm_cfg),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=2, 
    workers_per_gpu=1, 
    train=dict(
        type=dataset_type,
        data_root=data_root,
        img_dir='leftImg8bit/train',
        ann_dir='gtFine/train',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        data_root=data_root,
        img_dir='leftImg8bit/val',
        ann_dir='gtFine/val',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        data_root=data_root,
        img_dir='leftImg8bit/val',
        ann_dir='gtFine/val',
        pipeline=test_pipeline))

Note that I have customized some of the codes.

By the way, your error message indicates that 'gt_semantic_seg' is not properly read.
Please put print(data_batch) in def val_step in mmseg/models/segmentors/base.py, and see whether the key 'gt_semantic_seg' is in the data_batch. In my case, there is the key 'gt_semantic_seg' in the data_batch.
I wonder if your dataset or dataloader is possibly problematic.

@rubeea
Copy link

rubeea commented Feb 12, 2021

Hey @tetsu-kikuchi
Yes I already printed data_batch and realized that 'gt_semantic_seg' key is not in it but I believe that is because the validation pipeline does not contain the operation dict(type='LoadAnnotations') just as the train pipeline and that is why no annotations are loaded. May I know which functions do you use to build the datasets and train the model? I am using

datasets = [build_dataset(cfg.data.train), build_dataset(cfg.data.val)]

for building the datasets for the workflow= [('train'),('val')] and

train_segmentor in mmseg/apis/train.py for training as follows:

train_segmentor(model, datasets, cfg, distributed=False, validate=True, 
                meta=meta)

Moreover, if validate=True in the above function and the workflow is set to workflow=[('train')] only, I believe the statistics (loss, accuracy etc.) are calculated on the validation dataset and not the train dataset. Is that correct? Because if this is the case then I don't think there is a need to change the workflow to [('train'),('val')].

Thanks in advance.

@tetsu-kikuchi
Copy link
Author

tetsu-kikuchi commented Feb 14, 2021

Hi @rubeea
Both for building datasets and training models, I use the default code:

https://github.com/open-mmlab/mmsegmentation/blob/master/tools/train.py#L137

    datasets = [build_dataset(cfg.data.train)]
    if len(cfg.workflow) == 2:
        val_dataset = copy.deepcopy(cfg.data.val)
        val_dataset.pipeline = cfg.data.train.pipeline
        datasets.append(build_dataset(val_dataset))

https://github.com/open-mmlab/mmsegmentation/blob/master/tools/train.py#L152

    train_segmentor(
        model,
        datasets,
        cfg,
        distributed=distributed,
        validate=(not args.no_validate),
        timestamp=timestamp,
        meta=meta)

(To be precise, I downloaded the code on the last December, and use it by a slight customization for myself)

Moreover, if validate=True in the above function and the workflow is set to workflow=[('train')] only, I believe the statistics (loss, accuracy etc.) are calculated on the validation dataset and not the train dataset. Is that correct?

Sorry, I rarely set workflow=[('train')] only, so I do not know this point well.

@rubeea
Copy link

rubeea commented Feb 14, 2021

Hi @tetsu-kikuchi,

Thanks for your response. As suspected, you are using the train pipeline for the validation dataset as well
val_dataset.pipeline = cfg.data.train.pipeline
and that is why you have the 'gt_semantic_seg' keys in your data batch. However, I have noticed that setting the workflow=[('train')] with validate True option and
val_dataset.pipeline = cfg.data.val.pipeline
indeed runs the evaluation hook on the validation dataset during the entire training schedule. So, I guess it is the same.
Thanks for providing the explanations.

aravind-h-v pushed a commit to aravind-h-v/mmsegmentation that referenced this issue Mar 27, 2023
* Use ONNX / Core ML compatible method to broadcast.

Unfortunately `tile` could not be used either, it's still not compatible
with ONNX.

See open-mmlab#284.

* Add comment about why broadcast_to is not used.

Also, apply style to changed files.

* Make sure broadcast remains in same device.
sibozhang pushed a commit to sibozhang/mmsegmentation that referenced this issue Mar 22, 2024
* resolve comments

* update changelog

* remove redundant code

* update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants