-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
ca30db7
commit 1529e94
Showing
14 changed files
with
2,000 additions
and
4 deletions.
There are no files selected for viewing
38 changes: 38 additions & 0 deletions
38
.../2d_kpt_sview_rgb_img/topdown_heatmap/coco-wholebody/tcformer_coco-wholebody.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
<!-- [ALGORITHM] --> | ||
|
||
<details> | ||
<summary align="right"><a href="https://openaccess.thecvf.com/content/CVPR2022/html/Zeng_Not_All_Tokens_Are_Equal_Human-Centric_Visual_Analysis_via_Token_CVPR_2022_paper.html">TCFormer (CVPR'2022)</a></summary> | ||
|
||
```bibtex | ||
@inproceedings{zeng2022not, | ||
title={Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer}, | ||
author={Zeng, Wang and Jin, Sheng and Liu, Wentao and Qian, Chen and Luo, Ping and Ouyang, Wanli and Wang, Xiaogang}, | ||
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, | ||
pages={11101--11111}, | ||
year={2022} | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
<!-- [DATASET] --> | ||
|
||
<details> | ||
<summary align="right"><a href="https://link.springer.com/chapter/10.1007/978-3-030-58545-7_12">COCO-WholeBody (ECCV'2020)</a></summary> | ||
|
||
```bibtex | ||
@inproceedings{jin2020whole, | ||
title={Whole-Body Human Pose Estimation in the Wild}, | ||
author={Jin, Sheng and Xu, Lumin and Xu, Jin and Wang, Can and Liu, Wentao and Qian, Chen and Ouyang, Wanli and Luo, Ping}, | ||
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)}, | ||
year={2020} | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
Results on COCO-WholeBody v1.0 val with detector having human AP of 56.4 on COCO val2017 dataset | ||
|
||
| Arch | Input Size | Body AP | Body AR | Foot AP | Foot AR | Face AP | Face AR | Hand AP | Hand AR | Whole AP | Whole AR | ckpt | log | | ||
| :-------------------------------------- | :--------: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :------: | :------: | :--------------------------------------: | :-------------------------------------: | | ||
| [tcformer](/configs/wholebody/2d_kpt_sview_rgb_img/topdown_heatmap/coco-wholebody/tcformer_coco_wholebody_256x192.py) | 256x192 | 0.691 | 0.769 | 0.690 | 0.809 | 0.650 | 0.747 | 0.534 | 0.647 | 0.574 | 0.678 | [ckpt](https://download.openmmlab.com/mmpose/top_down/tcformer/tcformer_coco-wholebody_256x192-a0720efa_20220627.pth) | [log](https://download.openmmlab.com/mmpose/top_down/tcformer/tcformer_coco-wholebody_256x192_20220627.log.json) | |
30 changes: 30 additions & 0 deletions
30
...wholebody/2d_kpt_sview_rgb_img/topdown_heatmap/coco-wholebody/tcformer_coco-wholebody.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
Collections: | ||
- Name: TCFormer | ||
Paper: | ||
Title: 'Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering | ||
Transformer' | ||
URL: https://openaccess.thecvf.com/content/CVPR2022/html/Zeng_Not_All_Tokens_Are_Equal_Human-Centric_Visual_Analysis_via_Token_CVPR_2022_paper.html | ||
README: https://github.com/open-mmlab/mmpose/blob/master/docs/en/papers/backbones/tcformer.md | ||
Models: | ||
- Config: configs/wholebody/2d_kpt_sview_rgb_img/topdown_heatmap/coco-wholebody/tcformer_coco_wholebody_256x192.py | ||
In Collection: TCFormer | ||
Metadata: | ||
Architecture: | ||
- TCFormer | ||
Training Data: COCO-WholeBody | ||
Name: topdown_heatmap_tcformer_coco_wholebody_256x192 | ||
Results: | ||
- Dataset: COCO-WholeBody | ||
Metrics: | ||
Body AP: 0.691 | ||
Body AR: 0.769 | ||
Face AP: 0.65 | ||
Face AR: 0.747 | ||
Foot AP: 0.69 | ||
Foot AR: 0.809 | ||
Hand AP: 0.534 | ||
Hand AR: 0.647 | ||
Whole AP: 0.574 | ||
Whole AR: 0.678 | ||
Task: Wholebody 2D Keypoint | ||
Weights: https://download.openmmlab.com/mmpose/top_down/tcformer/tcformer_coco-wholebody_256x192-a0720efa_20220627.pth |
171 changes: 171 additions & 0 deletions
171
...dy/2d_kpt_sview_rgb_img/topdown_heatmap/coco-wholebody/tcformer_coco_wholebody_256x192.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
_base_ = ['../../../../_base_/datasets/coco_wholebody.py'] | ||
log_level = 'INFO' | ||
load_from = None | ||
resume_from = None | ||
dist_params = dict(backend='nccl') | ||
workflow = [('train', 1)] | ||
checkpoint_config = dict(interval=10) | ||
evaluation = dict(interval=10, metric='mAP', save_best='AP') | ||
|
||
optimizer = dict( | ||
type='AdamW', | ||
lr=5e-4, | ||
betas=(0.9, 0.999), | ||
weight_decay=0.01, | ||
) | ||
|
||
optimizer_config = dict(grad_clip=None) | ||
# learning policy | ||
lr_config = dict( | ||
policy='step', | ||
warmup='linear', | ||
warmup_iters=500, | ||
warmup_ratio=0.001, | ||
step=[170, 200]) | ||
total_epochs = 210 | ||
log_config = dict( | ||
interval=50, | ||
hooks=[ | ||
dict(type='TextLoggerHook'), | ||
# dict(type='TensorboardLoggerHook') | ||
]) | ||
|
||
channel_cfg = dict( | ||
num_output_channels=133, | ||
dataset_joints=133, | ||
dataset_channel=[ | ||
list(range(133)), | ||
], | ||
inference_channel=list(range(133))) | ||
|
||
# model settings | ||
norm_cfg = dict(type='SyncBN', requires_grad=True) | ||
model = dict( | ||
type='TopDown', | ||
pretrained='https://download.openmmlab.com/mmpose/' | ||
'pretrain_models/tcformer-4e1adbf1_20220421.pth', | ||
backbone=dict( | ||
type='TCFormer', | ||
embed_dims=[64, 128, 320, 512], | ||
num_heads=[1, 2, 5, 8], | ||
mlp_ratios=[8, 8, 4, 4], | ||
qkv_bias=True, | ||
num_layers=[3, 4, 6, 3], | ||
sr_ratios=[8, 4, 2, 1], | ||
drop_path_rate=0.1), | ||
neck=dict( | ||
type='MTA', | ||
in_channels=[64, 128, 320, 512], | ||
out_channels=256, | ||
start_level=0, | ||
num_heads=[4, 4, 4, 4], | ||
mlp_ratios=[4, 4, 4, 4], | ||
num_outs=4, | ||
use_sr_conv=False, | ||
), | ||
keypoint_head=dict( | ||
type='TopdownHeatmapSimpleHead', | ||
in_channels=256, | ||
out_channels=channel_cfg['num_output_channels'], | ||
num_deconv_layers=0, | ||
extra=dict(final_conv_kernel=1, ), | ||
loss_keypoint=dict(type='JointsMSELoss', use_target_weight=True)), | ||
train_cfg=dict(), | ||
test_cfg=dict( | ||
flip_test=True, | ||
post_process='default', | ||
shift_heatmap=True, | ||
modulate_kernel=11)) | ||
|
||
data_cfg = dict( | ||
image_size=[192, 256], | ||
heatmap_size=[48, 64], | ||
num_output_channels=channel_cfg['num_output_channels'], | ||
num_joints=channel_cfg['dataset_joints'], | ||
dataset_channel=channel_cfg['dataset_channel'], | ||
inference_channel=channel_cfg['inference_channel'], | ||
soft_nms=False, | ||
nms_thr=1.0, | ||
oks_thr=0.9, | ||
vis_thr=0.2, | ||
use_gt_bbox=False, | ||
det_bbox_thr=0.0, | ||
bbox_file='data/coco/person_detection_results/' | ||
'COCO_val2017_detections_AP_H_56_person.json', | ||
) | ||
|
||
train_pipeline = [ | ||
dict(type='LoadImageFromFile'), | ||
dict(type='TopDownGetBboxCenterScale', padding=1.25), | ||
dict(type='TopDownRandomShiftBboxCenter', shift_factor=0.16, prob=0.3), | ||
dict(type='TopDownRandomFlip', flip_prob=0.5), | ||
dict( | ||
type='TopDownHalfBodyTransform', | ||
num_joints_half_body=8, | ||
prob_half_body=0.3), | ||
dict( | ||
type='TopDownGetRandomScaleRotation', rot_factor=40, scale_factor=0.5), | ||
dict(type='TopDownAffine'), | ||
dict(type='ToTensor'), | ||
dict( | ||
type='NormalizeTensor', | ||
mean=[0.485, 0.456, 0.406], | ||
std=[0.229, 0.224, 0.225]), | ||
dict(type='TopDownGenerateTarget', sigma=2), | ||
dict( | ||
type='Collect', | ||
keys=['img', 'target', 'target_weight'], | ||
meta_keys=[ | ||
'image_file', 'joints_3d', 'joints_3d_visible', 'center', 'scale', | ||
'rotation', 'bbox_score', 'flip_pairs' | ||
]), | ||
] | ||
|
||
val_pipeline = [ | ||
dict(type='LoadImageFromFile'), | ||
dict(type='TopDownGetBboxCenterScale', padding=1.25), | ||
dict(type='TopDownAffine'), | ||
dict(type='ToTensor'), | ||
dict( | ||
type='NormalizeTensor', | ||
mean=[0.485, 0.456, 0.406], | ||
std=[0.229, 0.224, 0.225]), | ||
dict( | ||
type='Collect', | ||
keys=['img'], | ||
meta_keys=[ | ||
'image_file', 'center', 'scale', 'rotation', 'bbox_score', | ||
'flip_pairs' | ||
]), | ||
] | ||
|
||
test_pipeline = val_pipeline | ||
|
||
data_root = 'data/coco' | ||
data = dict( | ||
samples_per_gpu=64, | ||
workers_per_gpu=2, | ||
val_dataloader=dict(samples_per_gpu=32), | ||
test_dataloader=dict(samples_per_gpu=32), | ||
train=dict( | ||
type='TopDownCocoWholeBodyDataset', | ||
ann_file=f'{data_root}/annotations/coco_wholebody_train_v1.0.json', | ||
img_prefix=f'{data_root}/train2017/', | ||
data_cfg=data_cfg, | ||
pipeline=train_pipeline, | ||
dataset_info={{_base_.dataset_info}}), | ||
val=dict( | ||
type='TopDownCocoWholeBodyDataset', | ||
ann_file=f'{data_root}/annotations/coco_wholebody_val_v1.0.json', | ||
img_prefix=f'{data_root}/val2017/', | ||
data_cfg=data_cfg, | ||
pipeline=val_pipeline, | ||
dataset_info={{_base_.dataset_info}}), | ||
test=dict( | ||
type='TopDownCocoWholeBodyDataset', | ||
ann_file=f'{data_root}/annotations/coco_wholebody_val_v1.0.json', | ||
img_prefix=f'{data_root}/val2017/', | ||
data_cfg=data_cfg, | ||
pipeline=val_pipeline, | ||
dataset_info={{_base_.dataset_info}}), | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# Not All Tokens Are Equal: Human-Centric Visual Analysis via Token Clustering Transformer | ||
|
||
<!-- [ALGORITHM] --> | ||
|
||
<details> | ||
<summary align="right"><a href="https://openaccess.thecvf.com/content/CVPR2022/html/Zeng_Not_All_Tokens_Are_Equal_Human-Centric_Visual_Analysis_via_Token_CVPR_2022_paper.html">TCFormer (CVPR'2022)</a></summary> | ||
|
||
```bibtex | ||
@inproceedings{zeng2022not, | ||
title={Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer}, | ||
author={Zeng, Wang and Jin, Sheng and Liu, Wentao and Qian, Chen and Luo, Ping and Ouyang, Wanli and Wang, Xiaogang}, | ||
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, | ||
pages={11101--11111}, | ||
year={2022} | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
## Abstract | ||
|
||
<!-- [ABSTRACT] --> | ||
|
||
Vision transformers have achieved great successes in | ||
many computer vision tasks. Most methods generate | ||
vision tokens by splitting an image into a regular | ||
and fixed grid and treating each cell as a token. | ||
However, not all regions are equally important in | ||
human-centric vision tasks, e.g., the human body | ||
needs a fine representation with many tokens, while | ||
the image background can be modeled by a few tokens. | ||
To address this problem, we propose a novel Vision | ||
Transformer, called Token Clustering Transformer | ||
(TCFormer), which merges tokens by progressive | ||
clustering, where the tokens can be merged from | ||
different locations with flexible shapes and sizes. | ||
The tokens in TCFormer can not only focus on important | ||
areas but also adjust the token shapes to fit the | ||
semantic concept and adopt a fine resolution for | ||
regions containing critical details, which is | ||
beneficial to capturing detailed information. | ||
Extensive experiments show that TCFormer consistently | ||
outperforms its counterparts on different challenging | ||
humancentric tasks and datasets, including whole-body | ||
pose estimation on COCO-WholeBody and 3D human mesh | ||
reconstruction on 3DPW. Code is available at | ||
https://github.com/zengwang430521/TCFormer.git. | ||
|
||
<!-- [IMAGE] --> | ||
|
||
<div align=center> | ||
<img src="https://user-images.githubusercontent.com/28900607/175868010-b408e0dc-768c-4fb9-9095-5874fcb42d7f.png"> | ||
</div> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.