Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update MaskFormer readme and docs #7241

Merged
merged 9 commits into from
Feb 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 21 additions & 29 deletions configs/maskformer/README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,18 @@
# Per-Pixel Classification is Not All You Need for Semantic Segmentation
# MaskFormer

> [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278)

<!-- [ALGORITHM] -->

## Abstract

Modern approaches typically formulate semantic segmentation as a per-pixel classification
task, while instance-level segmentation is handled with an alternative mask
classification. Our key insight: mask classification is sufficiently general to solve
both semantic- and instance-level segmentation tasks in a unified manner using
the exact same model, loss, and training procedure. Following this observation,
we propose MaskFormer, a simple mask classification model which predicts a
set of binary masks, each associated with a single global class label prediction.
Overall, the proposed mask classification-based method simplifies the landscape
of effective approaches to semantic and panoptic segmentation tasks and shows
excellent empirical results. In particular, we observe that MaskFormer outperforms
per-pixel classification baselines when the number of classes is large. Our mask
classification-based method outperforms both current state-of-the-art semantic
(55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

<div align=center>
<img src="https://camo.githubusercontent.com/29fb22298d506ce176caad3006a7b05ef2603ca12cece6c788b7e73c046e8bc9/68747470733a2f2f626f77656e63303232312e6769746875622e696f2f696d616765732f6d61736b666f726d65722e706e67" height="300"/>
</div>

## Citation

```
@inproceedings{cheng2021maskformer,
title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
journal={NeurIPS},
year={2021}
}
```

## Dataset
## Introduction

MaskFormer requires COCO and [COCO-panoptic](http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip) dataset for training and evaluation. You need to download and extract it in the COCO dataset path.
The directory should be like this.
Expand All @@ -55,6 +36,17 @@ mmdetection

## Results and Models

| Backbone | style | Lr schd | Mem (GB) | Inf time (fps) | PQ | SQ | RQ | PQ_th | SQ_th | RQ_th | PQ_st | SQ_st | RQ_st | Config | Download | detail |
| :------: | :-----: | :-----: | :------: | :------------: | :-: | :-: | :-: | :---: | :---: | :---: | :---: | :---: | :---: | :---------------------------------------------------------------------------------------------------------------------: | :----------------------: | :---: |
| R-50 | pytorch | 75e | | | | | | | | | | | | [config](https://github.com/open-mmlab/mmdetection/tree/master/configs/maskformer/maskformer_r50_mstrain_16x1_75e_coco.py) | | This version was mentioned in Table XI, in paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) |
| Backbone | style | Lr schd | Mem (GB) | Inf time (fps) | PQ | SQ | RQ | PQ_th | SQ_th | RQ_th | PQ_st | SQ_st | RQ_st | Config | Download | detail |
|:--------:|:-------:|:-------:|:--------:|:--------------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:--------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|
| R-50 | pytorch | 75e | 16.6 | - | 46.854 | 80.617 | 57.085 | 51.089 | 81.511 | 61.853 | 40.463 | 79.269 | 49.888 | [config](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco_20220221_141956-bc2699cb.pth) &#124; [log](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco_20220221_141956.log.json) | This version was mentioned in Table XI, in paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) |

## Citation

```latex
@inproceedings{cheng2021maskformer,
title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
journal={NeurIPS},
year={2021}
}
```
6 changes: 3 additions & 3 deletions mmdet/core/bbox/assigners/mask_hungarian_assigner.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ class MaskHungarianAssigner(BaseAssigner):
- positive integer: positive sample, index (1-based) of assigned gt

Args:
cls_cost (obj:`mmcv.ConfigDict` | dict): Classification cost config.
mask_cost (obj:`mmcv.ConfigDict` | dict): Mask cost config.
dice_cost (obj:`mmcv.ConfigDict` | dict): Dice cost config.
cls_cost (:obj:`mmcv.ConfigDict` | dict): Classification cost config.
mask_cost (:obj:`mmcv.ConfigDict` | dict): Mask cost config.
dice_cost (:obj:`mmcv.ConfigDict` | dict): Dice cost config.
"""

def __init__(self,
Expand Down
76 changes: 37 additions & 39 deletions mmdet/models/dense_heads/maskformer_head.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,24 +28,24 @@ class MaskFormerHead(AnchorFreeHead):
num_things_classes (int): Number of things.
num_stuff_classes (int): Number of stuff.
num_queries (int): Number of query in Transformer.
pixel_decoder (obj:`mmcv.ConfigDict`|dict): Config for pixel decoder.
Defaults to None.
pixel_decoder (:obj:`mmcv.ConfigDict` | dict): Config for pixel
decoder. Defaults to None.
enforce_decoder_input_project (bool, optional): Whether to add a layer
to change the embed_dim of tranformer encoder in pixel decoder to
the embed_dim of transformer decoder. Defaults to False.
transformer_decoder (obj:`mmcv.ConfigDict`|dict): Config for
transformer_decoder (:obj:`mmcv.ConfigDict` | dict): Config for
transformer decoder. Defaults to None.
positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
transformer decoder position encoding. Defaults to None.
loss_cls (obj:`mmcv.ConfigDict`|dict): Config of the classification
loss_cls (:obj:`mmcv.ConfigDict` | dict): Config of the classification
loss. Defaults to `CrossEntropyLoss`.
loss_mask (obj:`mmcv.ConfigDict`|dict): Config of the mask loss.
loss_mask (:obj:`mmcv.ConfigDict` | dict): Config of the mask loss.
Defaults to `FocalLoss`.
loss_dice (obj:`mmcv.ConfigDict`|dict): Config of the dice loss.
loss_dice (:obj:`mmcv.ConfigDict` | dict): Config of the dice loss.
Defaults to `DiceLoss`.
train_cfg (obj:`mmcv.ConfigDict`|dict): Training config of Maskformer
head.
test_cfg (obj:`mmcv.ConfigDict`|dict): Testing config of Maskformer
train_cfg (:obj:`mmcv.ConfigDict` | dict): Training config of
Maskformer head.
test_cfg (:obj:`mmcv.ConfigDict` | dict): Testing config of Maskformer
head.
init_cfg (dict or list[dict], optional): Initialization config dict.
Defaults to None.
Expand Down Expand Up @@ -177,12 +177,11 @@ def preprocess_gt(self, gt_labels_list, gt_masks_list, gt_semantic_segs):

Returns:
tuple: a tuple containing the following targets.

- labels (list[Tensor]): Ground truth class indices for all\
images. Each with shape (n, ), n is the sum of number\
of stuff type and number of instance in a image.
- masks (list[Tensor]): Ground truth mask for each image, each\
with shape (n, h, w).
- labels (list[Tensor]): Ground truth class indices\
for all images. Each with shape (n, ), n is the sum of\
number of stuff type and number of instance in a image.
- masks (list[Tensor]): Ground truth mask for each\
image, each with shape (n, h, w).
"""
num_things_list = [self.num_things_classes] * len(gt_labels_list)
num_stuff_list = [self.num_stuff_classes] * len(gt_labels_list)
Expand Down Expand Up @@ -213,19 +212,18 @@ def get_targets(self, cls_scores_list, mask_preds_list, gt_labels_list,

Returns:
tuple[list[Tensor]]: a tuple containing the following targets.

- labels_list (list[Tensor]): Labels of all images.\
Each with shape (num_queries, ).
- label_weights_list (list[Tensor]): Label weights of all\
images. Each with shape (num_queries, ).
- mask_targets_list (list[Tensor]): Mask targets of all\
images. Each with shape (num_queries, h, w).
- mask_weights_list (list[Tensor]): Mask weights of all\
images. Each with shape (num_queries, ).
- num_total_pos (int): Number of positive samples in all\
images.
- num_total_neg (int): Number of negative samples in all\
images.
- label_weights_list (list[Tensor]): Label weights\
of all images. Each with shape (num_queries, ).
- mask_targets_list (list[Tensor]): Mask targets of\
all images. Each with shape (num_queries, h, w).
- mask_weights_list (list[Tensor]): Mask weights of\
all images. Each with shape (num_queries, ).
- num_total_pos (int): Number of positive samples in\
all images.
- num_total_neg (int): Number of negative samples in\
all images.
"""
(labels_list, label_weights_list, mask_targets_list, mask_weights_list,
pos_inds_list,
Expand Down Expand Up @@ -256,7 +254,6 @@ def _get_target_single(self, cls_score, mask_pred, gt_labels, gt_masks,

Returns:
tuple[Tensor]: a tuple containing the following for one image.

- labels (Tensor): Labels of each image.
shape (num_queries, ).
- label_weights (Tensor): Label weights of each image.
Expand Down Expand Up @@ -444,13 +441,14 @@ def forward(self, feats, img_metas):
img_metas (list[dict]): List of image information.

Returns:
all_cls_scores (Tensor): Classification scores for each\
scale level. Each is a 4D-tensor with shape\
(num_decoder, batch_size, num_queries, cls_out_channels).\
Note `cls_out_channels` should includes background.
all_mask_preds (Tensor): Mask scores for each decoder\
layer. Each with shape (num_decoder, batch_size,\
num_queries, h, w).
tuple: a tuple contains two elements.
- all_cls_scores (Tensor): Classification scores for each\
scale level. Each is a 4D-tensor with shape\
(num_decoder, batch_size, num_queries, cls_out_channels).\
Note `cls_out_channels` should includes background.
- all_mask_preds (Tensor): Mask scores for each decoder\
layer. Each with shape (num_decoder, batch_size,\
num_queries, h, w).
"""
batch_size = len(img_metas)
input_img_h, input_img_w = img_metas[0]['batch_input_shape']
Expand Down Expand Up @@ -528,7 +526,7 @@ def forward_train(self,
ignored. Defaults to None.

Returns:
losses (dict[str, Tensor]): a dictionary of loss components
dict[str, Tensor]: a dictionary of loss components
"""
# not consider ignoring bboxes
assert gt_bboxes_ignore is None
Expand Down Expand Up @@ -607,8 +605,8 @@ def simple_test(self, feats, img_metas, rescale=False):
def post_process(self, mask_cls, mask_pred):
"""Panoptic segmengation inference.

This implementation is modified from\
https://github.com/facebookresearch/MaskFormer
This implementation is modified from `MaskFormer
<https://github.com/facebookresearch/MaskFormer>`_.

Args:
mask_cls (Tensor): Classfication outputs for a image.
Expand All @@ -617,7 +615,7 @@ def post_process(self, mask_cls, mask_pred):
shape = (num_queries, h, w).

Returns:
panoptic_seg (Tensor): panoptic segment result of shape (h, w),\
Tensor: panoptic segment result of shape (h, w),\
each element in Tensor means:
segment_id = _cls + instance_id * INSTANCE_OFFSET.
"""
Expand Down
2 changes: 1 addition & 1 deletion mmdet/models/detectors/maskformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
class MaskFormer(SingleStageDetector):
r"""Implementation of `Per-Pixel Classification is
NOT All You Need for Semantic Segmentation
<https://arxiv.org/pdf/2107.06278>`_"""
<https://arxiv.org/pdf/2107.06278>`_."""

def __init__(self,
backbone,
Expand Down
24 changes: 11 additions & 13 deletions mmdet/models/plugins/pixel_decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,17 @@ class PixelDecoder(BaseModule):
input feature maps.
feat_channels (int): Number channels for feature.
out_channels (int): Number channels for output.
norm_cfg (obj:`mmcv.ConfigDict`|dict): Config for normalization.
norm_cfg (:obj:`mmcv.ConfigDict` | dict): Config for normalization.
Defaults to dict(type='GN', num_groups=32).
act_cfg (obj:`mmcv.ConfigDict`|dict): Config for activation.
act_cfg (:obj:`mmcv.ConfigDict` | dict): Config for activation.
Defaults to dict(type='ReLU').
encoder (obj:`mmcv.ConfigDict`|dict): Config for transorformer
encoder (:obj:`mmcv.ConfigDict` | dict): Config for transorformer
encoder.Defaults to None.
positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
transformer encoder position encoding. Defaults to
dict(type='SinePositionalEncoding', num_feats=128,
normalize=True).
init_cfg (obj:`mmcv.ConfigDict`|dict): Initialization config dict.
init_cfg (:obj:`mmcv.ConfigDict` | dict): Initialization config dict.
Default: None
"""

Expand Down Expand Up @@ -95,10 +95,9 @@ def forward(self, feats, img_metas):

Returns:
tuple: a tuple containing the following:

- mask_feature (Tensor): Shape (batch_size, c, h, w).
- memory (Tensor): Output of last stage of backbone.\
Shape (batch_size, c, h, w).
Shape (batch_size, c, h, w).
"""
y = self.last_feat_conv(feats[-1])
for i in range(self.num_inputs - 2, -1, -1):
Expand All @@ -122,17 +121,17 @@ class TransformerEncoderPixelDecoder(PixelDecoder):
input feature maps.
feat_channels (int): Number channels for feature.
out_channels (int): Number channels for output.
norm_cfg (obj:`mmcv.ConfigDict`|dict): Config for normalization.
norm_cfg (:obj:`mmcv.ConfigDict` | dict): Config for normalization.
Defaults to dict(type='GN', num_groups=32).
act_cfg (obj:`mmcv.ConfigDict`|dict): Config for activation.
act_cfg (:obj:`mmcv.ConfigDict` | dict): Config for activation.
Defaults to dict(type='ReLU').
encoder (obj:`mmcv.ConfigDict`|dict): Config for transorformer
encoder (:obj:`mmcv.ConfigDict` | dict): Config for transorformer
encoder.Defaults to None.
positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
transformer encoder position encoding. Defaults to
dict(type='SinePositionalEncoding', num_feats=128,
normalize=True).
init_cfg (obj:`mmcv.ConfigDict`|dict): Initialization config dict.
init_cfg (:obj:`mmcv.ConfigDict` | dict): Initialization config dict.
Default: None
"""

Expand Down Expand Up @@ -200,7 +199,6 @@ def forward(self, feats, img_metas):

Returns:
tuple: a tuple containing the following:

- mask_feature (Tensor): shape (batch_size, c, h, w).
- memory (Tensor): shape (batch_size, c, h, w).
"""
Expand Down
2 changes: 1 addition & 1 deletion tools/deployment/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ def main():


if __name__ == '__main__':
# main()
main()

# Following strings of text style are from colorama package
bright_style, reset_style = '\x1b[1m', '\x1b[0m'
Expand Down