Video Understanding

We mainly focus on training ViT models for action recognition on "something-something-v2" (SSv2) because it emphasizes understanding multi-frame actions instead of single-frame semantics. Our code-base is built upon VideoMAE. Great appreciation for the authors of datasets and code-bases! If you have any question on our action recognition part, checking their original repositories will also be helpful.

1. Data Preparation

Step 1: download the dataset from SSv2's official website.
Step 2: As mentioned in VideoMAE, you should (1) preprocess SSv2 into mp4 formats with height dimension aligned to 240 pixels, and (2) download their train.csv and val.csv from VideoMAE's google drive. If you are confused about the above steps, you are just like me. Please read the following sentence: checkout the solution provided in this issue, which provides a more detailed guide for data preprocessing. I was following the same procedure.

2. Environment Setup

We basically follow VideoMAE's guide for installation. We summarize the most important ones below for your convenience:

Pytorch >= 1.8.0
timm == 0.4.12
decord (for decoding videos on-the-fly in the dataloaders)
einops
We were unable to use deepspeed on our server and also made corresponding changes in our code. If you want to scale up our method, please check out original VideoMAE for integration with deepspeed.

Before proceeding, please make sure you have downloaded the checkpoint for LLaMA-7B from LLaMA-v1 (link).

3. Running Experiments

Downloading checkpoints. We finetune the models from the checkpoints pretrained by MAE in a self-supervised way by VideoMAE. Follow the instructions to download the checkpoints and put them in ./checkpoints/.
Training. Then use the scripts in ./scripts/ to run the training of models, e.g., ssv2_vitb_llama.sh. If you want to train the models with LLaMA, make sure the --llama_path option pointing to the directory of your LLaMA-7B checkpoints. The contents in the directory should contains things like: checklist.chk, consolidated.00.pth, and params.json.
Evaluation. The training script will automatically conduct evaluation, displayed at the end of logs. If you want to evaluate a separate checkpoint, please add --eval to the training script and use --resume to point to the checkpoint you would like to evaluate.

4. Model Zoo

Model	Checkpoint	Acc1	Acc5
ViT-S	[log] / [model]	64.71	89.15
ViT-S-LLaMA	[log] / [model]	65.88	89.93
ViT-B	[log] / [model]	64.97	89.50
ViT-B-LLaMA	[log] / [model]	66.03	90.25

5. Key Places to Watch

The modification to video models are quite similar to image classification.

In llama.py, we re-write LLaMA's code by removing positional embedding and auto-regressive attention masks.
The major modeling of ViT-LLaMA is in vit_llama.py. The initialization and forward are straightforward:

# initialization
...
self.llama = LLaMATransformer(llama_configs)
for param in self.llama.parameters():
    param.requires_grad = False
self.llama_dim_mapper1 = nn.Linear(embed_dim, 4096, bias=False)
self.llama_dim_mapper2 = nn.Linear(4096, embed_dim, bias=False)
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Video Understanding

1. Data Preparation

2. Environment Setup

3. Running Experiments

4. Model Zoo

5. Key Places to Watch

Files

README.md

Latest commit

History

README.md

File metadata and controls

Video Understanding

1. Data Preparation

2. Environment Setup

3. Running Experiments

4. Model Zoo

5. Key Places to Watch