open-mmlab · ZwwWayne · Feb 24, 2022 · Feb 23, 2022 · Feb 24, 2022 · Feb 24, 2022
diff --git a/configs/maskformer/README.md b/configs/maskformer/README.md
@@ -1,37 +1,18 @@
-# Per-Pixel Classification is Not All You Need for Semantic Segmentation
+# MaskFormer
+
+> [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278)
+
+<!-- [ALGORITHM] -->
 
 ## Abstract
 
-Modern approaches typically formulate semantic segmentation as a per-pixel classification
-task, while instance-level segmentation is handled with an alternative mask
-classification. Our key insight: mask classification is sufficiently general to solve
-both semantic- and instance-level segmentation tasks in a unified manner using
-the exact same model, loss, and training procedure. Following this observation,
-we propose MaskFormer, a simple mask classification model which predicts a
-set of binary masks, each associated with a single global class label prediction.
-Overall, the proposed mask classification-based method simplifies the landscape
-of effective approaches to semantic and panoptic segmentation tasks and shows
-excellent empirical results. In particular, we observe that MaskFormer outperforms
-per-pixel classification baselines when the number of classes is large. Our mask
-classification-based method outperforms both current state-of-the-art semantic
-(55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
+Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic  segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
 
 <div align=center>
 <img src="https://camo.githubusercontent.com/29fb22298d506ce176caad3006a7b05ef2603ca12cece6c788b7e73c046e8bc9/68747470733a2f2f626f77656e63303232312e6769746875622e696f2f696d616765732f6d61736b666f726d65722e706e67" height="300"/>
 </div>
 
-## Citation
-
-```
-@inproceedings{cheng2021maskformer,
-  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
-  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
-  journal={NeurIPS},
-  year={2021}
-}
-```
-
-## Dataset
+## Introduction
 
 MaskFormer requires COCO and [COCO-panoptic](http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip) dataset for training and evaluation. You need to download and extract it in the COCO dataset path.
 The directory should be like this.
@@ -55,6 +36,17 @@ mmdetection
 
 ## Results and Models
 
-| Backbone |  style  | Lr schd | Mem (GB) | Inf time (fps) | PQ | SQ | RQ | PQ_th | SQ_th | RQ_th | PQ_st | SQ_st | RQ_st |                                                         Config                                                         |         Download         | detail |
-| :------: | :-----: | :-----: | :------: | :------------: | :-: | :-: | :-: | :---: | :---: | :---: | :---: | :---: | :---: | :---------------------------------------------------------------------------------------------------------------------: | :----------------------: | :---: |
-| R-50 | pytorch |    75e    |          |                |    |    |    |      |      |      |      |      |      | [config](https://github.com/open-mmlab/mmdetection/tree/master/configs/maskformer/maskformer_r50_mstrain_16x1_75e_coco.py) |  | This version was mentioned in Table XI, in paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) |
+| Backbone |  style  | Lr schd | Mem (GB) | Inf time (fps) |   PQ   |   SQ   |   RQ   | PQ_th  | SQ_th  | RQ_th  | PQ_st  | SQ_st  | RQ_st  |                                                           Config                                                           |                                                                                                                                                                    Download                                                                                                                                                                     |                                                                         detail                                                                          |
+|:--------:|:-------:|:-------:|:--------:|:--------------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:--------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|
+|   R-50   | pytorch |   75e   |   16.6   |       -        | 46.854 | 80.617 | 57.085 | 51.089 | 81.511 | 61.853 | 40.463 | 79.269 | 49.888 | [config](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco_20220221_141956-bc2699cb.pth) &#124; [log](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco_20220221_141956.log.json) | This version was mentioned in Table XI, in paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) |
+
+## Citation
+
+```latex
+@inproceedings{cheng2021maskformer,
+  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
+  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
+  journal={NeurIPS},
+  year={2021}
+}
+```
diff --git a/mmdet/core/bbox/assigners/mask_hungarian_assigner.py b/mmdet/core/bbox/assigners/mask_hungarian_assigner.py
@@ -29,9 +29,9 @@ class MaskHungarianAssigner(BaseAssigner):
     - positive integer: positive sample, index (1-based) of assigned gt
 
     Args:
-        cls_cost (obj:`mmcv.ConfigDict` | dict): Classification cost config.
-        mask_cost (obj:`mmcv.ConfigDict` | dict): Mask cost config.
-        dice_cost (obj:`mmcv.ConfigDict` | dict): Dice cost config.
+        cls_cost (:obj:`mmcv.ConfigDict` | dict): Classification cost config.
+        mask_cost (:obj:`mmcv.ConfigDict` | dict): Mask cost config.
+        dice_cost (:obj:`mmcv.ConfigDict` | dict): Dice cost config.
     """
 
     def __init__(self,

diff --git a/mmdet/models/dense_heads/maskformer_head.py b/mmdet/models/dense_heads/maskformer_head.py
@@ -28,24 +28,24 @@ class MaskFormerHead(AnchorFreeHead):
         num_things_classes (int): Number of things.
         num_stuff_classes (int): Number of stuff.
         num_queries (int): Number of query in Transformer.
-        pixel_decoder (obj:`mmcv.ConfigDict`|dict): Config for pixel decoder.
-            Defaults to None.
+        pixel_decoder (:obj:`mmcv.ConfigDict` | dict): Config for pixel
+            decoder. Defaults to None.
         enforce_decoder_input_project (bool, optional): Whether to add a layer
             to change the embed_dim of tranformer encoder in pixel decoder to
             the embed_dim of transformer decoder. Defaults to False.
-        transformer_decoder (obj:`mmcv.ConfigDict`|dict): Config for
+        transformer_decoder (:obj:`mmcv.ConfigDict` | dict): Config for
             transformer decoder. Defaults to None.
-        positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
+        positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
             transformer decoder position encoding. Defaults to None.
-        loss_cls (obj:`mmcv.ConfigDict`|dict): Config of the classification
+        loss_cls (:obj:`mmcv.ConfigDict` | dict): Config of the classification
             loss. Defaults to `CrossEntropyLoss`.
-        loss_mask (obj:`mmcv.ConfigDict`|dict): Config of the mask loss.
+        loss_mask (:obj:`mmcv.ConfigDict` | dict): Config of the mask loss.
             Defaults to `FocalLoss`.
-        loss_dice (obj:`mmcv.ConfigDict`|dict): Config of the dice loss.
+        loss_dice (:obj:`mmcv.ConfigDict` | dict): Config of the dice loss.
             Defaults to `DiceLoss`.
-        train_cfg (obj:`mmcv.ConfigDict`|dict): Training config of Maskformer
-            head.
-        test_cfg (obj:`mmcv.ConfigDict`|dict): Testing config of Maskformer
+        train_cfg (:obj:`mmcv.ConfigDict` | dict): Training config of
+            Maskformer head.
+        test_cfg (:obj:`mmcv.ConfigDict` | dict): Testing config of Maskformer
             head.
         init_cfg (dict or list[dict], optional): Initialization config dict.
             Defaults to None.
@@ -177,12 +177,11 @@ def preprocess_gt(self, gt_labels_list, gt_masks_list, gt_semantic_segs):
 
         Returns:
             tuple: a tuple containing the following targets.
-
-                - labels (list[Tensor]): Ground truth class indices for all\
-                    images. Each with shape (n, ), n is the sum of number\
-                    of stuff type and number of instance in a image.
-                - masks (list[Tensor]): Ground truth mask for each image, each\
-                    with shape (n, h, w).
+                - labels (list[Tensor]): Ground truth class indices\
+                    for all images. Each with shape (n, ), n is the sum of\
+                    number of stuff type and number of instance in a image.
+                - masks (list[Tensor]): Ground truth mask for each\
+                    image, each with shape (n, h, w).
         """
         num_things_list = [self.num_things_classes] * len(gt_labels_list)
         num_stuff_list = [self.num_stuff_classes] * len(gt_labels_list)
@@ -213,19 +212,18 @@ def get_targets(self, cls_scores_list, mask_preds_list, gt_labels_list,
 
         Returns:
             tuple[list[Tensor]]: a tuple containing the following targets.
-
                 - labels_list (list[Tensor]): Labels of all images.\
                     Each with shape (num_queries, ).
-                - label_weights_list (list[Tensor]): Label weights of all\
-                    images. Each with shape (num_queries, ).
-                - mask_targets_list (list[Tensor]): Mask targets of all\
-                    images. Each with shape (num_queries, h, w).
-                - mask_weights_list (list[Tensor]): Mask weights of all\
-                    images. Each with shape (num_queries, ).
-                - num_total_pos (int): Number of positive samples in all\
-                    images.
-                - num_total_neg (int): Number of negative samples in all\
-                    images.
+                - label_weights_list (list[Tensor]): Label weights\
+                    of all images. Each with shape (num_queries, ).
+                - mask_targets_list (list[Tensor]): Mask targets of\
+                    all images. Each with shape (num_queries, h, w).
+                - mask_weights_list (list[Tensor]): Mask weights of\
+                    all images. Each with shape (num_queries, ).
+                - num_total_pos (int): Number of positive samples in\
+                    all images.
+                - num_total_neg (int): Number of negative samples in\
+                    all images.
         """
         (labels_list, label_weights_list, mask_targets_list, mask_weights_list,
          pos_inds_list,
@@ -256,7 +254,6 @@ def _get_target_single(self, cls_score, mask_pred, gt_labels, gt_masks,
 
         Returns:
             tuple[Tensor]: a tuple containing the following for one image.
-
                 - labels (Tensor): Labels of each image.
                     shape (num_queries, ).
                 - label_weights (Tensor): Label weights of each image.
@@ -444,13 +441,14 @@ def forward(self, feats, img_metas):
             img_metas (list[dict]): List of image information.
 
         Returns:
-            all_cls_scores (Tensor): Classification scores for each\
-                scale level. Each is a 4D-tensor with shape\
-                (num_decoder, batch_size, num_queries, cls_out_channels).\
-                Note `cls_out_channels` should includes background.
-            all_mask_preds (Tensor): Mask scores for each decoder\
-                layer. Each with shape (num_decoder, batch_size,\
-                num_queries, h, w).
+            tuple: a tuple contains two elements.
+                - all_cls_scores (Tensor): Classification scores for each\
+                    scale level. Each is a 4D-tensor with shape\
+                    (num_decoder, batch_size, num_queries, cls_out_channels).\
+                    Note `cls_out_channels` should includes background.
+                - all_mask_preds (Tensor): Mask scores for each decoder\
+                    layer. Each with shape (num_decoder, batch_size,\
+                    num_queries, h, w).
         """
         batch_size = len(img_metas)
         input_img_h, input_img_w = img_metas[0]['batch_input_shape']
@@ -528,7 +526,7 @@ def forward_train(self,
                 ignored. Defaults to None.
 
         Returns:
-            losses (dict[str, Tensor]): a dictionary of loss components
+            dict[str, Tensor]: a dictionary of loss components
         """
         # not consider ignoring bboxes
         assert gt_bboxes_ignore is None
@@ -607,8 +605,8 @@ def simple_test(self, feats, img_metas, rescale=False):
     def post_process(self, mask_cls, mask_pred):
         """Panoptic segmengation inference.
 
-        This implementation is modified from\
-            https://github.com/facebookresearch/MaskFormer
+        This implementation is modified from `MaskFormer
+        <https://github.com/facebookresearch/MaskFormer>`_.
 
         Args:
             mask_cls (Tensor): Classfication outputs for a image.
@@ -617,7 +615,7 @@ def post_process(self, mask_cls, mask_pred):
                 shape = (num_queries, h, w).
 
         Returns:
-            panoptic_seg (Tensor): panoptic segment result of shape (h, w),\
+            Tensor: panoptic segment result of shape (h, w),\
                 each element in Tensor means:
                 segment_id = _cls + instance_id * INSTANCE_OFFSET.
         """

diff --git a/mmdet/models/detectors/maskformer.py b/mmdet/models/detectors/maskformer.py
@@ -7,7 +7,7 @@
 class MaskFormer(SingleStageDetector):
     r"""Implementation of `Per-Pixel Classification is
     NOT All You Need for Semantic Segmentation
-    <https://arxiv.org/pdf/2107.06278>`_"""
+    <https://arxiv.org/pdf/2107.06278>`_."""
 
     def __init__(self,
                  backbone,

diff --git a/mmdet/models/plugins/pixel_decoder.py b/mmdet/models/plugins/pixel_decoder.py
@@ -17,17 +17,17 @@ class PixelDecoder(BaseModule):
             input feature maps.
         feat_channels (int): Number channels for feature.
         out_channels (int): Number channels for output.
-        norm_cfg (obj:`mmcv.ConfigDict`|dict): Config for normalization.
+        norm_cfg (:obj:`mmcv.ConfigDict` | dict): Config for normalization.
             Defaults to dict(type='GN', num_groups=32).
-        act_cfg (obj:`mmcv.ConfigDict`|dict): Config for activation.
+        act_cfg (:obj:`mmcv.ConfigDict` | dict): Config for activation.
             Defaults to dict(type='ReLU').
-        encoder (obj:`mmcv.ConfigDict`|dict): Config for transorformer
+        encoder (:obj:`mmcv.ConfigDict` | dict): Config for transorformer
             encoder.Defaults to None.
-        positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
+        positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
             transformer encoder position encoding. Defaults to
             dict(type='SinePositionalEncoding', num_feats=128,
             normalize=True).
-        init_cfg (obj:`mmcv.ConfigDict`|dict):  Initialization config dict.
+        init_cfg (:obj:`mmcv.ConfigDict` | dict):  Initialization config dict.
             Default: None
     """
 
@@ -95,10 +95,9 @@ def forward(self, feats, img_metas):
 
         Returns:
             tuple: a tuple containing the following:
-
                 - mask_feature (Tensor): Shape (batch_size, c, h, w).
                 - memory (Tensor): Output of last stage of backbone.\
-                    Shape (batch_size, c, h, w).
+                        Shape (batch_size, c, h, w).
         """
         y = self.last_feat_conv(feats[-1])
         for i in range(self.num_inputs - 2, -1, -1):
@@ -122,17 +121,17 @@ class TransformerEncoderPixelDecoder(PixelDecoder):
             input feature maps.
         feat_channels (int): Number channels for feature.
         out_channels (int): Number channels for output.
-        norm_cfg (obj:`mmcv.ConfigDict`|dict): Config for normalization.
+        norm_cfg (:obj:`mmcv.ConfigDict` | dict): Config for normalization.
             Defaults to dict(type='GN', num_groups=32).
-        act_cfg (obj:`mmcv.ConfigDict`|dict): Config for activation.
+        act_cfg (:obj:`mmcv.ConfigDict` | dict): Config for activation.
             Defaults to dict(type='ReLU').
-        encoder (obj:`mmcv.ConfigDict`|dict): Config for transorformer
+        encoder (:obj:`mmcv.ConfigDict` | dict): Config for transorformer
             encoder.Defaults to None.
-        positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
+        positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
             transformer encoder position encoding. Defaults to
             dict(type='SinePositionalEncoding', num_feats=128,
             normalize=True).
-        init_cfg (obj:`mmcv.ConfigDict`|dict):  Initialization config dict.
+        init_cfg (:obj:`mmcv.ConfigDict` | dict):  Initialization config dict.
             Default: None
     """
 
@@ -200,7 +199,6 @@ def forward(self, feats, img_metas):
 
         Returns:
             tuple: a tuple containing the following:
-
                 - mask_feature (Tensor): shape (batch_size, c, h, w).
                 - memory (Tensor): shape (batch_size, c, h, w).
         """

diff --git a/tools/deployment/test.py b/tools/deployment/test.py
@@ -141,7 +141,7 @@ def main():
 
 
 if __name__ == '__main__':
-    # main()
+    main()
 
     # Following strings of text style are from colorama package
     bright_style, reset_style = '\x1b[1m', '\x1b[0m'