Skip to content

Commit

Permalink
add tcformer (#1447)
Browse files Browse the repository at this point in the history
  • Loading branch information
zengwang430521 authored Jun 28, 2022
1 parent ca30db7 commit 1529e94
Show file tree
Hide file tree
Showing 14 changed files with 2,000 additions and 4 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<!-- [ALGORITHM] -->

<details>
<summary align="right"><a href="https://openaccess.thecvf.com/content/CVPR2022/html/Zeng_Not_All_Tokens_Are_Equal_Human-Centric_Visual_Analysis_via_Token_CVPR_2022_paper.html">TCFormer (CVPR'2022)</a></summary>

```bibtex
@inproceedings{zeng2022not,
title={Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer},
author={Zeng, Wang and Jin, Sheng and Liu, Wentao and Qian, Chen and Luo, Ping and Ouyang, Wanli and Wang, Xiaogang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={11101--11111},
year={2022}
}
```

</details>

<!-- [DATASET] -->

<details>
<summary align="right"><a href="https://link.springer.com/chapter/10.1007/978-3-030-58545-7_12">COCO-WholeBody (ECCV'2020)</a></summary>

```bibtex
@inproceedings{jin2020whole,
title={Whole-Body Human Pose Estimation in the Wild},
author={Jin, Sheng and Xu, Lumin and Xu, Jin and Wang, Can and Liu, Wentao and Qian, Chen and Ouyang, Wanli and Luo, Ping},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2020}
}
```

</details>

Results on COCO-WholeBody v1.0 val with detector having human AP of 56.4 on COCO val2017 dataset

| Arch | Input Size | Body AP | Body AR | Foot AP | Foot AR | Face AP | Face AR | Hand AP | Hand AR | Whole AP | Whole AR | ckpt | log |
| :-------------------------------------- | :--------: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :------: | :------: | :--------------------------------------: | :-------------------------------------: |
| [tcformer](/configs/wholebody/2d_kpt_sview_rgb_img/topdown_heatmap/coco-wholebody/tcformer_coco_wholebody_256x192.py) | 256x192 | 0.691 | 0.769 | 0.690 | 0.809 | 0.650 | 0.747 | 0.534 | 0.647 | 0.574 | 0.678 | [ckpt](https://download.openmmlab.com/mmpose/top_down/tcformer/tcformer_coco-wholebody_256x192-a0720efa_20220627.pth) | [log](https://download.openmmlab.com/mmpose/top_down/tcformer/tcformer_coco-wholebody_256x192_20220627.log.json) |
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
Collections:
- Name: TCFormer
Paper:
Title: 'Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering
Transformer'
URL: https://openaccess.thecvf.com/content/CVPR2022/html/Zeng_Not_All_Tokens_Are_Equal_Human-Centric_Visual_Analysis_via_Token_CVPR_2022_paper.html
README: https://github.com/open-mmlab/mmpose/blob/master/docs/en/papers/backbones/tcformer.md
Models:
- Config: configs/wholebody/2d_kpt_sview_rgb_img/topdown_heatmap/coco-wholebody/tcformer_coco_wholebody_256x192.py
In Collection: TCFormer
Metadata:
Architecture:
- TCFormer
Training Data: COCO-WholeBody
Name: topdown_heatmap_tcformer_coco_wholebody_256x192
Results:
- Dataset: COCO-WholeBody
Metrics:
Body AP: 0.691
Body AR: 0.769
Face AP: 0.65
Face AR: 0.747
Foot AP: 0.69
Foot AR: 0.809
Hand AP: 0.534
Hand AR: 0.647
Whole AP: 0.574
Whole AR: 0.678
Task: Wholebody 2D Keypoint
Weights: https://download.openmmlab.com/mmpose/top_down/tcformer/tcformer_coco-wholebody_256x192-a0720efa_20220627.pth
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
_base_ = ['../../../../_base_/datasets/coco_wholebody.py']
log_level = 'INFO'
load_from = None
resume_from = None
dist_params = dict(backend='nccl')
workflow = [('train', 1)]
checkpoint_config = dict(interval=10)
evaluation = dict(interval=10, metric='mAP', save_best='AP')

optimizer = dict(
type='AdamW',
lr=5e-4,
betas=(0.9, 0.999),
weight_decay=0.01,
)

optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=500,
warmup_ratio=0.001,
step=[170, 200])
total_epochs = 210
log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook'),
# dict(type='TensorboardLoggerHook')
])

channel_cfg = dict(
num_output_channels=133,
dataset_joints=133,
dataset_channel=[
list(range(133)),
],
inference_channel=list(range(133)))

# model settings
norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
type='TopDown',
pretrained='https://download.openmmlab.com/mmpose/'
'pretrain_models/tcformer-4e1adbf1_20220421.pth',
backbone=dict(
type='TCFormer',
embed_dims=[64, 128, 320, 512],
num_heads=[1, 2, 5, 8],
mlp_ratios=[8, 8, 4, 4],
qkv_bias=True,
num_layers=[3, 4, 6, 3],
sr_ratios=[8, 4, 2, 1],
drop_path_rate=0.1),
neck=dict(
type='MTA',
in_channels=[64, 128, 320, 512],
out_channels=256,
start_level=0,
num_heads=[4, 4, 4, 4],
mlp_ratios=[4, 4, 4, 4],
num_outs=4,
use_sr_conv=False,
),
keypoint_head=dict(
type='TopdownHeatmapSimpleHead',
in_channels=256,
out_channels=channel_cfg['num_output_channels'],
num_deconv_layers=0,
extra=dict(final_conv_kernel=1, ),
loss_keypoint=dict(type='JointsMSELoss', use_target_weight=True)),
train_cfg=dict(),
test_cfg=dict(
flip_test=True,
post_process='default',
shift_heatmap=True,
modulate_kernel=11))

data_cfg = dict(
image_size=[192, 256],
heatmap_size=[48, 64],
num_output_channels=channel_cfg['num_output_channels'],
num_joints=channel_cfg['dataset_joints'],
dataset_channel=channel_cfg['dataset_channel'],
inference_channel=channel_cfg['inference_channel'],
soft_nms=False,
nms_thr=1.0,
oks_thr=0.9,
vis_thr=0.2,
use_gt_bbox=False,
det_bbox_thr=0.0,
bbox_file='data/coco/person_detection_results/'
'COCO_val2017_detections_AP_H_56_person.json',
)

train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='TopDownGetBboxCenterScale', padding=1.25),
dict(type='TopDownRandomShiftBboxCenter', shift_factor=0.16, prob=0.3),
dict(type='TopDownRandomFlip', flip_prob=0.5),
dict(
type='TopDownHalfBodyTransform',
num_joints_half_body=8,
prob_half_body=0.3),
dict(
type='TopDownGetRandomScaleRotation', rot_factor=40, scale_factor=0.5),
dict(type='TopDownAffine'),
dict(type='ToTensor'),
dict(
type='NormalizeTensor',
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
dict(type='TopDownGenerateTarget', sigma=2),
dict(
type='Collect',
keys=['img', 'target', 'target_weight'],
meta_keys=[
'image_file', 'joints_3d', 'joints_3d_visible', 'center', 'scale',
'rotation', 'bbox_score', 'flip_pairs'
]),
]

val_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='TopDownGetBboxCenterScale', padding=1.25),
dict(type='TopDownAffine'),
dict(type='ToTensor'),
dict(
type='NormalizeTensor',
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
dict(
type='Collect',
keys=['img'],
meta_keys=[
'image_file', 'center', 'scale', 'rotation', 'bbox_score',
'flip_pairs'
]),
]

test_pipeline = val_pipeline

data_root = 'data/coco'
data = dict(
samples_per_gpu=64,
workers_per_gpu=2,
val_dataloader=dict(samples_per_gpu=32),
test_dataloader=dict(samples_per_gpu=32),
train=dict(
type='TopDownCocoWholeBodyDataset',
ann_file=f'{data_root}/annotations/coco_wholebody_train_v1.0.json',
img_prefix=f'{data_root}/train2017/',
data_cfg=data_cfg,
pipeline=train_pipeline,
dataset_info={{_base_.dataset_info}}),
val=dict(
type='TopDownCocoWholeBodyDataset',
ann_file=f'{data_root}/annotations/coco_wholebody_val_v1.0.json',
img_prefix=f'{data_root}/val2017/',
data_cfg=data_cfg,
pipeline=val_pipeline,
dataset_info={{_base_.dataset_info}}),
test=dict(
type='TopDownCocoWholeBodyDataset',
ann_file=f'{data_root}/annotations/coco_wholebody_val_v1.0.json',
img_prefix=f'{data_root}/val2017/',
data_cfg=data_cfg,
pipeline=val_pipeline,
dataset_info={{_base_.dataset_info}}),
)
53 changes: 53 additions & 0 deletions docs/en/papers/backbones/tcformer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Not All Tokens Are Equal: Human-Centric Visual Analysis via Token Clustering Transformer

<!-- [ALGORITHM] -->

<details>
<summary align="right"><a href="https://openaccess.thecvf.com/content/CVPR2022/html/Zeng_Not_All_Tokens_Are_Equal_Human-Centric_Visual_Analysis_via_Token_CVPR_2022_paper.html">TCFormer (CVPR'2022)</a></summary>

```bibtex
@inproceedings{zeng2022not,
title={Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer},
author={Zeng, Wang and Jin, Sheng and Liu, Wentao and Qian, Chen and Luo, Ping and Ouyang, Wanli and Wang, Xiaogang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={11101--11111},
year={2022}
}
```

</details>

## Abstract

<!-- [ABSTRACT] -->

Vision transformers have achieved great successes in
many computer vision tasks. Most methods generate
vision tokens by splitting an image into a regular
and fixed grid and treating each cell as a token.
However, not all regions are equally important in
human-centric vision tasks, e.g., the human body
needs a fine representation with many tokens, while
the image background can be modeled by a few tokens.
To address this problem, we propose a novel Vision
Transformer, called Token Clustering Transformer
(TCFormer), which merges tokens by progressive
clustering, where the tokens can be merged from
different locations with flexible shapes and sizes.
The tokens in TCFormer can not only focus on important
areas but also adjust the token shapes to fit the
semantic concept and adopt a fine resolution for
regions containing critical details, which is
beneficial to capturing detailed information.
Extensive experiments show that TCFormer consistently
outperforms its counterparts on different challenging
humancentric tasks and datasets, including whole-body
pose estimation on COCO-WholeBody and 3D human mesh
reconstruction on 3DPW. Code is available at
https://github.com/zengwang430521/TCFormer.git.

<!-- [IMAGE] -->

<div align=center>
<img src="https://user-images.githubusercontent.com/28900607/175868010-b408e0dc-768c-4fb9-9095-5874fcb42d7f.png">
</div>
3 changes: 2 additions & 1 deletion mmpose/models/backbones/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from .shufflenet_v1 import ShuffleNetV1
from .shufflenet_v2 import ShuffleNetV2
from .swin import SwinTransformer
from .tcformer import TCFormer
from .tcn import TCN
from .v2v_net import V2VNet
from .vgg import VGG
Expand All @@ -34,5 +35,5 @@
'SEResNet', 'SEResNeXt', 'ShuffleNetV1', 'ShuffleNetV2', 'CPM', 'RSN',
'MSPN', 'ResNeSt', 'VGG', 'TCN', 'ViPNAS_ResNet', 'ViPNAS_MobileNetV3',
'LiteHRNet', 'V2VNet', 'HRFormer', 'PyramidVisionTransformer',
'PyramidVisionTransformerV2', 'SwinTransformer', 'I3D'
'PyramidVisionTransformerV2', 'SwinTransformer', 'I3D', 'TCFormer'
]
Loading

0 comments on commit 1529e94

Please sign in to comment.