We mainly focus on training ViT models for action recognition on "something-something-v2" (SSv2) because it emphasizes understanding multi-frame actions instead of single-frame semantics. Our code-base is built upon VideoMAE. Great appreciation for the authors of datasets and code-bases! If you have any question on our action recognition part, checking their original repositories will also be helpful.
-
Step 1: download the dataset from SSv2's official website.
-
Step 2: As mentioned in VideoMAE, you should (1) preprocess SSv2 into
mp4
formats with height dimension aligned to 240 pixels, and (2) download theirtrain.csv
andval.csv
from VideoMAE's google drive. If you are confused about the above steps, you are just like me. Please read the following sentence: checkout the solution provided in this issue, which provides a more detailed guide for data preprocessing. I was following the same procedure.
We basically follow VideoMAE's guide for installation. We summarize the most important ones below for your convenience:
Pytorch >= 1.8.0
timm == 0.4.12
decord
(for decoding videos on-the-fly in the dataloaders)einops
- We were unable to use
deepspeed
on our server and also made corresponding changes in our code. If you want to scale up our method, please check out original VideoMAE for integration with deepspeed.
Before proceeding, please make sure you have downloaded the checkpoint for LLaMA-7B from LLaMA-v1 (link).
-
Downloading checkpoints. We finetune the models from the checkpoints pretrained by MAE in a self-supervised way by VideoMAE. Follow the instructions to download the checkpoints and put them in
./checkpoints/
. -
Training. Then use the scripts in
./scripts/
to run the training of models, e.g., ssv2_vitb_llama.sh. If you want to train the models with LLaMA, make sure the--llama_path
option pointing to the directory of your LLaMA-7B checkpoints. The contents in the directory should contains things like:checklist.chk
,consolidated.00.pth
, andparams.json
. -
Evaluation. The training script will automatically conduct evaluation, displayed at the end of logs. If you want to evaluate a separate checkpoint, please add
--eval
to the training script and use--resume
to point to the checkpoint you would like to evaluate.
Model | Checkpoint | Acc1 | Acc5 |
---|---|---|---|
ViT-S | [log] / [model] | 64.71 | 89.15 |
ViT-S-LLaMA | [log] / [model] | 65.88 | 89.93 |
ViT-B | [log] / [model] | 64.97 | 89.50 |
ViT-B-LLaMA | [log] / [model] | 66.03 | 90.25 |
The modification to video models are quite similar to image classification.
- In
llama.py
, we re-write LLaMA's code by removing positional embedding and auto-regressive attention masks. - The major modeling of ViT-LLaMA is in
vit_llama.py
. The initialization and forward are straightforward:
# initialization
...
self.llama = LLaMATransformer(llama_configs)
for param in self.llama.parameters():
param.requires_grad = False
self.llama_dim_mapper1 = nn.Linear(embed_dim, 4096, bias=False)
self.llama_dim_mapper2 = nn.Linear(4096, embed_dim, bias=False)
...