-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Doc] Add related docs for PoseWarper (#1036)
* add related docs for PoseWarper * add related readme docs for posewarper * modify related args in posewarper stage2 config * modify posewarper stage2 config path
- Loading branch information
Showing
9 changed files
with
566 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Video-based Single-view 2D Human Body Pose Estimation | ||
|
||
Multi-person 2D human pose estimation in video is defined as the task of detecting the poses (or keypoints) of all people from an input video. | ||
|
||
For this task, we currently support [PoseWarper](/configs/body/2d_kpt_sview_rgb_vid/posewarper). | ||
|
||
## Data preparation | ||
|
||
Please follow [DATA Preparation](/docs/tasks/2d_body_keypoint.md) to prepare data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Learning Temporal Pose Estimation from Sparsely-Labeled Videos | ||
|
||
<!-- [ALGORITHM] --> | ||
|
||
<details> | ||
<summary align="right"><a href="https://arxiv.org/abs/1906.04016">PoseWarper (NeurIPS'2019)</a></summary> | ||
|
||
```bibtex | ||
@inproceedings{NIPS2019_gberta, | ||
title = {Learning Temporal Pose Estimation from Sparsely Labeled Videos}, | ||
author = {Bertasius, Gedas and Feichtenhofer, Christoph, and Tran, Du and Shi, Jianbo, and Torresani, Lorenzo}, | ||
booktitle = {Advances in Neural Information Processing Systems 33}, | ||
year = {2019}, | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
PoseWarper proposes a network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pair of video frames, a labeled Frame A and an unlabeled Frame B, the model is trained to predict human pose in Frame A using the features from Frame B by means of deformable convolutions to implicitly learn the pose warping between A and B. | ||
|
||
The training of PoseWarper can be split into two stages. | ||
|
||
The first-stage is trained with the pre-trained model and the main backbone is fine-tuned in a single-frame setting. | ||
|
||
The second-stage is trained with the model from the first stage, and the warping offsets are learned in a multi-frame setting while the backbone is frozen. |
86 changes: 86 additions & 0 deletions
86
...ody/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_posetrack18_posewarper.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
|
||
<!-- [ALGORITHM] --> | ||
|
||
<details> | ||
<summary align="right"><a href="https://arxiv.org/abs/1906.04016">PoseWarper (NeurIPS'2019)</a></summary> | ||
|
||
```bibtex | ||
@inproceedings{NIPS2019_gberta, | ||
title = {Learning Temporal Pose Estimation from Sparsely Labeled Videos}, | ||
author = {Bertasius, Gedas and Feichtenhofer, Christoph, and Tran, Du and Shi, Jianbo, and Torresani, Lorenzo}, | ||
booktitle = {Advances in Neural Information Processing Systems 33}, | ||
year = {2019}, | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
<!-- [ALGORITHM] --> | ||
|
||
<details> | ||
<summary align="right"><a href="http://openaccess.thecvf.com/content_CVPR_2019/html/Sun_Deep_High-Resolution_Representation_Learning_for_Human_Pose_Estimation_CVPR_2019_paper.html">HRNet (CVPR'2019)</a></summary> | ||
|
||
```bibtex | ||
@inproceedings{sun2019deep, | ||
title={Deep high-resolution representation learning for human pose estimation}, | ||
author={Sun, Ke and Xiao, Bin and Liu, Dong and Wang, Jingdong}, | ||
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition}, | ||
pages={5693--5703}, | ||
year={2019} | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
<!-- [DATASET] --> | ||
|
||
<details> | ||
<summary align="right"><a href="http://openaccess.thecvf.com/content_cvpr_2018/html/Andriluka_PoseTrack_A_Benchmark_CVPR_2018_paper.html">PoseTrack18 (CVPR'2018)</a></summary> | ||
|
||
```bibtex | ||
@inproceedings{andriluka2018posetrack, | ||
title={Posetrack: A benchmark for human pose estimation and tracking}, | ||
author={Andriluka, Mykhaylo and Iqbal, Umar and Insafutdinov, Eldar and Pishchulin, Leonid and Milan, Anton and Gall, Juergen and Schiele, Bernt}, | ||
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, | ||
pages={5167--5176}, | ||
year={2018} | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
<!-- [DATASET] --> | ||
|
||
<details> | ||
<summary align="right"><a href="https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48">COCO (ECCV'2014)</a></summary> | ||
|
||
```bibtex | ||
@inproceedings{lin2014microsoft, | ||
title={Microsoft coco: Common objects in context}, | ||
author={Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\'a}r, Piotr and Zitnick, C Lawrence}, | ||
booktitle={European conference on computer vision}, | ||
pages={740--755}, | ||
year={2014}, | ||
organization={Springer} | ||
} | ||
``` | ||
|
||
</details> | ||
|
||
Note that the training of PoseWarper can be split into two stages. | ||
|
||
The first-stage is trained with the pre-trained [checkpoint](https://download.openmmlab.com/mmpose/top_down/hrnet/hrnet_w48_coco_384x288-314c8528_20200708.pth) on COCO dataset, and the main backbone is fine-tuned on PoseTrack18 in a single-frame setting. | ||
|
||
The second-stage is trained with the last [checkpoint](https://download.openmmlab.com/mmpose/top_down/posewarper/hrnet_w48_posetrack18_384x288_posewarper_stage1-08b632aa_20211130.pth) from the first stage, and the warping offsets are learned in a multi-frame setting while the backbone is frozen. | ||
|
||
Results on PoseTrack2018 val with ground-truth bounding boxes | ||
|
||
| Arch | Input Size | Head | Shou | Elb | Wri | Hip | Knee | Ankl | Total | ckpt | log | | ||
| :--- | :--------: | :------: |:------: |:------: |:------: |:------: |:------: | :------: | :------: |:------: |:------: | | ||
| [pose_hrnet_w48](/configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_w48_posetrack18_384x288_posewarper_stage2.py) | 384x288 | 88.2 | 90.3 | 86.1 | 81.6 | 81.8 | 83.8 | 81.5 | 85.0 | [ckpt](https://download.openmmlab.com/mmpose/top_down/posewarper/hrnet_w48_posetrack18_384x288_posewarper_stage2-4abf88db_20211130.pth) | [log](https://download.openmmlab.com/mmpose/top_down/posewarper/hrnet_w48_posetrack18_384x288_posewarper_stage2_20211130.log.json) | | ||
|
||
Results on PoseTrack2018 val with precomputed human bounding boxes from PoseWarper supplementary data files from [this link](https://www.dropbox.com/s/ygfy6r8nitoggfq/PoseWarper_supp_files.zip?dl=0). | ||
|
||
| Arch | Input Size | Head | Shou | Elb | Wri | Hip | Knee | Ankl | Total | ckpt | log | | ||
| :--- | :--------: | :------: |:------: |:------: |:------: |:------: |:------: | :------: | :------: |:------: |:------: | | ||
| [pose_hrnet_w48](/configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_w48_posetrack18_384x288_posewarper_stage2.py) | 384x288 | 81.8 | 85.6 | 82.7 | 77.2 | 76.8 | 79.0 | 74.4 | 79.8 | [ckpt](https://download.openmmlab.com/mmpose/top_down/posewarper/hrnet_w48_posetrack18_384x288_posewarper_stage2-4abf88db_20211130.pth) | [log](https://download.openmmlab.com/mmpose/top_down/posewarper/hrnet_w48_posetrack18_384x288_posewarper_stage2_20211130.log.json) | |
48 changes: 48 additions & 0 deletions
48
configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_posetrack18_posewarper.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
Collections: | ||
- Name: PoseWarper | ||
Paper: | ||
Title: Learning Temporal Pose Estimation from Sparsely Labeled Videos | ||
URL: https://arxiv.org/abs/1906.04016 | ||
Models: | ||
- Config: configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_w48_posetrack18_384x288_posewarper_stage2.py | ||
In Collection: PoseWarper | ||
Metadata: | ||
Architecture: &id001 | ||
- PoseWarper | ||
- HRNet | ||
Training Data: COCO | ||
Name: posewarper_hrnet_w48_posetrack18_384x288_posewarper_stage2 | ||
README: configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_posetrack18_posewarper.md | ||
Results: | ||
- Dataset: COCO | ||
Metrics: | ||
Ankl: 81.5 | ||
Elb: 86.1 | ||
Head: 88.2 | ||
Hip: 81.8 | ||
Knee: 83.8 | ||
Shou: 90.3 | ||
Total: 85.0 | ||
Wri: 81.6 | ||
Task: Body 2D Keypoint | ||
Weights: https://download.openmmlab.com/mmpose/top_down/posewarper/hrnet_w48_posetrack18_384x288_posewarper_stage2-4abf88db_20211130.pth | ||
- Config: configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_w48_posetrack18_384x288_posewarper_stage2.py | ||
In Collection: PoseWarper | ||
Metadata: | ||
Architecture: *id001 | ||
Training Data: COCO | ||
Name: posewarper_hrnet_w48_posetrack18_384x288_posewarper_stage2 | ||
README: configs/body/2d_kpt_sview_rgb_vid/posewarper/posetrack18/hrnet_posetrack18_posewarper.md | ||
Results: | ||
- Dataset: COCO | ||
Metrics: | ||
Ankl: 74.4 | ||
Elb: 82.7 | ||
Head: 81.8 | ||
Hip: 76.8 | ||
Knee: 79.0 | ||
Shou: 85.6 | ||
Total: 79.8 | ||
Wri: 77.2 | ||
Task: Body 2D Keypoint | ||
Weights: https://download.openmmlab.com/mmpose/top_down/posewarper/hrnet_w48_posetrack18_384x288_posewarper_stage2-4abf88db_20211130.pth |
174 changes: 174 additions & 0 deletions
174
...t_sview_rgb_vid/posewarper/posetrack18/hrnet_w48_posetrack18_384x288_posewarper_stage1.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
_base_ = ['../../../../_base_/datasets/posetrack18.py'] | ||
log_level = 'INFO' | ||
load_from = 'https://download.openmmlab.com/mmpose/top_down/hrnet/hrnet_w48_coco_384x288-314c8528_20200708.pth' # noqa: E501 | ||
resume_from = None | ||
dist_params = dict(backend='nccl') | ||
cudnn_benchmark = True | ||
workflow = [('train', 1)] | ||
checkpoint_config = dict(interval=1) | ||
evaluation = dict(interval=1, metric='mAP', save_best='Total AP') | ||
|
||
optimizer = dict( | ||
type='Adam', | ||
lr=0.0001, | ||
) | ||
optimizer_config = dict(grad_clip=None) | ||
# learning policy | ||
lr_config = dict(policy='step', step=[5, 7]) | ||
total_epochs = 10 | ||
log_config = dict( | ||
interval=50, | ||
hooks=[ | ||
dict(type='TextLoggerHook'), | ||
# dict(type='TensorboardLoggerHook') | ||
]) | ||
|
||
channel_cfg = dict( | ||
num_output_channels=17, | ||
dataset_joints=17, | ||
dataset_channel=[ | ||
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], | ||
], | ||
inference_channel=[ | ||
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 | ||
]) | ||
|
||
# model settings | ||
model = dict( | ||
type='TopDown', | ||
pretrained=None, | ||
backbone=dict( | ||
type='HRNet', | ||
in_channels=3, | ||
extra=dict( | ||
stage1=dict( | ||
num_modules=1, | ||
num_branches=1, | ||
block='BOTTLENECK', | ||
num_blocks=(4, ), | ||
num_channels=(64, )), | ||
stage2=dict( | ||
num_modules=1, | ||
num_branches=2, | ||
block='BASIC', | ||
num_blocks=(4, 4), | ||
num_channels=(48, 96)), | ||
stage3=dict( | ||
num_modules=4, | ||
num_branches=3, | ||
block='BASIC', | ||
num_blocks=(4, 4, 4), | ||
num_channels=(48, 96, 192)), | ||
stage4=dict( | ||
num_modules=3, | ||
num_branches=4, | ||
block='BASIC', | ||
num_blocks=(4, 4, 4, 4), | ||
num_channels=(48, 96, 192, 384))), | ||
), | ||
keypoint_head=dict( | ||
type='TopdownHeatmapSimpleHead', | ||
in_channels=48, | ||
out_channels=channel_cfg['num_output_channels'], | ||
num_deconv_layers=0, | ||
extra=dict(final_conv_kernel=1, ), | ||
loss_keypoint=dict(type='JointsMSELoss', use_target_weight=True)), | ||
train_cfg=dict(), | ||
test_cfg=dict( | ||
flip_test=True, | ||
post_process='default', | ||
shift_heatmap=True, | ||
modulate_kernel=11)) | ||
|
||
data_cfg = dict( | ||
image_size=[288, 384], | ||
heatmap_size=[72, 96], | ||
num_output_channels=channel_cfg['num_output_channels'], | ||
num_joints=channel_cfg['dataset_joints'], | ||
dataset_channel=channel_cfg['dataset_channel'], | ||
inference_channel=channel_cfg['inference_channel'], | ||
soft_nms=False, | ||
nms_thr=1.0, | ||
oks_thr=0.9, | ||
vis_thr=0.2, | ||
use_gt_bbox=True, | ||
det_bbox_thr=0.2, | ||
bbox_file='data/posetrack18/annotations/' | ||
'posetrack18_val_human_detections.json', | ||
) | ||
|
||
train_pipeline = [ | ||
dict(type='LoadImageFromFile'), | ||
dict( | ||
type='TopDownHalfBodyTransform', | ||
num_joints_half_body=8, | ||
prob_half_body=0.3), | ||
dict( | ||
type='TopDownGetRandomScaleRotation', rot_factor=45, | ||
scale_factor=0.35), | ||
dict(type='TopDownRandomFlip', flip_prob=0.5), | ||
dict(type='TopDownAffine'), | ||
dict(type='ToTensor'), | ||
dict( | ||
type='NormalizeTensor', | ||
mean=[0.485, 0.456, 0.406], | ||
std=[0.229, 0.224, 0.225]), | ||
dict(type='TopDownGenerateTarget', sigma=3), | ||
dict( | ||
type='Collect', | ||
keys=['img', 'target', 'target_weight'], | ||
meta_keys=[ | ||
'image_file', 'joints_3d', 'joints_3d_visible', 'center', 'scale', | ||
'rotation', 'bbox_score', 'flip_pairs' | ||
]), | ||
] | ||
|
||
val_pipeline = [ | ||
dict(type='LoadImageFromFile'), | ||
dict(type='TopDownAffine'), | ||
dict(type='ToTensor'), | ||
dict( | ||
type='NormalizeTensor', | ||
mean=[0.485, 0.456, 0.406], | ||
std=[0.229, 0.224, 0.225]), | ||
dict( | ||
type='Collect', | ||
keys=[ | ||
'img', | ||
], | ||
meta_keys=[ | ||
'image_file', 'center', 'scale', 'rotation', 'bbox_score', | ||
'flip_pairs' | ||
]), | ||
] | ||
|
||
test_pipeline = val_pipeline | ||
|
||
data_root = 'data/posetrack18' | ||
data = dict( | ||
samples_per_gpu=16, | ||
workers_per_gpu=3, | ||
val_dataloader=dict(samples_per_gpu=16), | ||
test_dataloader=dict(samples_per_gpu=16), | ||
train=dict( | ||
type='TopDownPoseTrack18Dataset', | ||
ann_file=f'{data_root}/annotations/posetrack18_train.json', | ||
img_prefix=f'{data_root}/', | ||
data_cfg=data_cfg, | ||
pipeline=train_pipeline, | ||
dataset_info={{_base_.dataset_info}}), | ||
val=dict( | ||
type='TopDownPoseTrack18Dataset', | ||
ann_file=f'{data_root}/annotations/posetrack18_val.json', | ||
img_prefix=f'{data_root}/', | ||
data_cfg=data_cfg, | ||
pipeline=val_pipeline, | ||
dataset_info={{_base_.dataset_info}}), | ||
test=dict( | ||
type='TopDownPoseTrack18Dataset', | ||
ann_file=f'{data_root}/annotations/posetrack18_val.json', | ||
img_prefix=f'{data_root}/', | ||
data_cfg=data_cfg, | ||
pipeline=test_pipeline, | ||
dataset_info={{_base_.dataset_info}}), | ||
) |
Oops, something went wrong.