This is an official implementaion of paper "Group Contextualization for Video Recognition", which has been accepted by CVPR 2022. Paper link
- Release this V1 version (the version used in paper) to public.
The code is built with following libraries:
- PyTorch >= 1.7, torchvision
- tensorboardx
For video data pre-processing, you may need ffmpeg.
For GC-TSN, GC-GST, GC-TSM, we need to first extract videos into frames for all datasets (Kinetics-400, Something-Something V1 and V2, Diving48 and EGTEA Gaze+), following the TSN repo. While for GC-TDN, the data process follows the backbone TDN work, which resizes the short edge of video to 320px and directly decodes video mp4 file during training/evaluation.
GC-TSN/TSM/GST/TDN codes are based on TSN, TSM, GST and TDN codebases, respectively.
Here we provide some of the pretrained models.
Model | Frame * view * clip | Top-1 Acc. | Top-5 Acc. | Checkpoint |
---|---|---|---|---|
GC-TSN ResNet50 | 8 * 1 * 10 | 75.2% | 92.1% | link |
GC-TSM ResNet50 | 8 * 1 * 10 | 75.4% | 91.9% | link |
GC-TSM ResNet50 | 16 * 1 * 10 | 76.7% | 92.9% | link |
GC-TSM ResNet50 | 16 * 3 * 10 | 77.1% | 92.9% | |
GC-TDN ResNet50 | 8 * 3 * 10 | 77.3% | 93.2% | link |
GC-TDN ResNet50 | 16 * 3 * 10 | 78.8% | 93.8% | link |
GC-TDN ResNet50 | (8+16) * 3 * 10 | 79.6% | 94.1% |
Something-Something V1&V2 datasets are highly temporal-related. Here, we use the 224×224 resolution for performance report.
Model | Frame * view * clip | Top-1 Acc. | Top-5 Acc. | Checkpoint |
---|---|---|---|---|
GC-GST ResNet50 | 8 * 1 * 2 | 48.8% | 78.5% | link |
GC-GST ResNet50 | 16 * 1 * 2 | 50.4% | 79.4% | link |
GC-GST ResNet50 | (8+16) * 1 * 2 | 52.5% | 81.3% | |
GC-TSN ResNet50 | 8 * 1 * 2 | 49.7% | 78.2% | link |
GC-TSN ResNet50 | 16 * 1 * 2 | 51.3% | 80.0% | link |
GC-TSN ResNet50 | (8+16) * 1 * 2 | 53.7% | 81.8% | |
GC-TSM ResNet50 | 8 * 1 * 2 | 51.1% | 79.4% | link |
GC-TSM ResNet50 | 16 * 1 * 2 | 53.1% | 81.2% | link |
GC-TSM ResNet50 | (8+16) * 1 * 2 | 55.0% | 82.6% | |
GC-TSM ResNet50 | (8+16) * 3 * 2 | 55.3% | 82.7% | |
GC-TDN ResNet50 | 8 * 1 * 1 | 53.7% | 82.2% | link |
GC-TDN ResNet50 | 16 * 1 * 1 | 55.0% | 82.3% | link |
GC-TDN ResNet50 | (8+16) * 1 * 1 | 56.4% | 84.0% |
Model | Frame * view * clip | Top-1 Acc. | Top-5 Acc. | Checkpoint |
---|---|---|---|---|
GC-GST ResNet50 | 8 * 1 * 2 | 61.9% | 87.8% | link |
GC-GST ResNet50 | 16 * 1 * 2 | 63.3% | 88.5% | link |
GC-GST ResNet50 | (8+16) * 1 * 2 | 65.0% | 89.5% | |
GC-TSN ResNet50 | 8 * 1 * 2 | 62.4% | 87.9% | link |
GC-TSN ResNet50 | 16 * 1 * 2 | 64.8% | 89.4% | link |
GC-TSN ResNet50 | (8+16) * 1 * 2 | 66.3% | 90.3% | |
GC-TSM ResNet50 | 8 * 1 * 2 | 63.0% | 88.4% | link |
GC-TSM ResNet50 | 16 * 1 * 2 | 64.9% | 89.7% | link |
GC-TSM ResNet50 | (8+16) * 1 * 2 | 66.7% | 90.6% | |
GC-TSM ResNet50 | (8+16) * 3 * 2 | 67.5% | 90.9% | |
GC-TDN ResNet50 | 8 * 1 * 1 | 64.9% | 89.7% | link |
GC-TDN ResNet50 | 16 * 1 * 1 | 65.9% | 90.0% | link |
GC-TDN ResNet50 | (8+16) * 1 * 1 | 67.8% | 91.2% |
Model | Frame * view * clip | Top-1 Acc. | Checkpoint |
---|---|---|---|
GC-GST ResNet50 | 16 * 1 * 1 | 82.5% | link |
GC-TSN ResNet50 | 16 * 1 * 1 | 86.8% | link |
GC-TSM ResNet50 | 16 * 1 * 1 | 87.2% | link |
GC-TDN ResNet50 | 16 * 1 * 1 | 87.6% | link |
Model | Frame * view * clip | Split1 | Split2 | Split3 |
---|---|---|---|---|
GC-GST ResNet50 | 8 * 1 * 1 | 65.5% | 61.6% | 60.6% |
GC-TSN ResNet50 | 8 * 1 * 1 | 66.4% | 64.6% | 61.4% |
GC-TSM ResNet50 | 8 * 1 * 1 | 66.5% | 66.1% | 62.6% |
GC-TDN ResNet50 | 8 * 1 * 1 | 65.0% | 61.8% | 61.0% |
For different backbones, please use their corresponding training code, like 'train_tsn.sh' with the usage of TSN.
For TSN/TSM/GST backbones, please use the test py "test_models_tsntsmgst_gc.py", run 'sh bash_test_tsntsmgst_gc.sh'. Please change the "from ops_tsntsmgst.models_tsn import VideoNet" (line-19 in test_models_tsntsmgst_gc.py) with the corresponding model name.
For TDN backbone, please use its official test file, see https://github.com/MCG-NJU/TDN.
GC codes are jointly written and owned by Dr. Yanbin Hao and Dr. Hao Zhang.
@article{gc2022,
title={Group Contextualization for Video Recognition},
author={Yanbin Hao, Hao Zhang, Chong-Wah Ngo, Xiangnan He},
journal={CVPR 2022},
}
Thanks for the following Github projects: