Skip to content

Latest commit

 

History

History
138 lines (93 loc) · 7.68 KB

README.md

File metadata and controls

138 lines (93 loc) · 7.68 KB

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

arch

Quang Minh Dinh, Minh Khoi Ho, Anh Quan Dang, Hung Phong Tran

This repository contains the official code for the 3th place solution of the 8th AI City Challenge Track 2.

Paper | Slide | Poster | Primary contact: Quang Minh Dinh

Requirements

The codebase is tested on

  • Ubuntu 20.04
  • Python 3.10
  • PyTorch 2.1.0
  • 1 NVIDIA GPU (RTX 3060) with CUDA version 12.1. (Other GPUs are also suitable, and 12GB GPU memory is sufficient to run our code.)

To install requirements:

pip install -r requirements.txt

Setup

All main configurations are stored in config.py. To complete the necessary setups for training:

  • Change _C.GLOB.EXP_PARENT_DIR to the path to your log directory. All the experimental logs, metrics, samples, and checkpoints will be stored here.
  • Download the vid2seq_htmchaptersvitt.pth checkpoint from VidChapters (HowTo100M + VidChapters-7M + ViTT). Replace the value of _C.MODEL.VID2SEQ_PATH to the path to the downloaded checkpoint. You can try other Vid2Seq checkpoints. If you want to directly start from the T5 checkpoint, set _C.MODEL.LOAD_VID2SEQ_CKPT to False.
  • All the dataset-related paths are stored in dataset/config.py. Follow the instructions in the next section to prepare the data for training.
  • Set _C.SOLVER.LOG_TO_WANDB to False if you don't want to log to Wandb.

Each experiment has a configuration .yml file located in experiments/. You can create your own experiment by adding a {EXP_NAME}.yml file and overwrite the default hyperparameters in config.py in the same manner as experiments/default.yml.

Data

Follow the steps outlined in feature_extraction/README.md to extract the CLIP features for training or download the extracted features here.

Change all the sub_global, local_annotated, and global paths for each dataset configuration in dataset/config.py to the generated feature paths. Change all captions paths to the annotation paths.

Checkpoints

You can download the experiment fine-tuned checkpoints via:

Experiment Name For Ablation Download Descriptions
high_fps_wd Yes Google drive link Sub-global
local_wd Yes Google drive link Sub-global + Local
local_temp_wd Yes Google drive link Sub-global + Local + Phase Encoder
global_main_wd Yes Google drive link Global
global_main_local_wd Yes Google drive link Global + Local + Phase Encoder
global_main_sub Yes Google drive link Global + Sub-global
global_all_wd Yes Google drive link Global + Sub-global + Local + Phase Encoder
high_fps_all No Google drive link Sub-global
local_temp_all No Google drive link Sub-global + Local + Phase Encoder

And put the files in the log directory as follows

{LOG_DIR}
│
├─── high_fps_wd
|       └─ epoch_20.th
├─── local_temp_all
|       ├─ epoch_20.th
|       └─ epoch_30.th
└─── ...

To use a checkpoint for an experiment, either set SOLVER.LOAD_FROM_EPOCH to the checkpoint epoch number or set SOLVER.LOAD_FROM_PATH to the checkpoint path in the experiment .yml file.

Training

To run an experiment:

python train.py {EXP_NAME}

where the experiment EXP_NAME has the corresponding configuration file at experiments/{EXP_NAME}.yml.

Training hyper-parameters like device, batch size, validation interval and save interval can be modified in the configuration file.

To replicate the results in Table 2 and Table 3 of the paper, use the corresponding experiment in the checkpoint table at the previous section for the fine-tuning. All the checkpoints selected for evaluation are provided in the checkpoint table.

Generate Test Samples

Set SOLVER.LOAD_FROM_EPOCH to the checkpoint epoch number or SOLVER.LOAD_FROM_PATH to the checkpoint path in the experiment .yml file.

To generate captions for the samples from the WTS blind test set:

python generate_test.py {EXP_NAME} -d {DEVICE} -b {BATCH}

where:

  • EXP_NAME: The name of the finished experiment used to generate the test samples.
  • DEVICE: Decide on which device to run the generation on. Default is cuda.
  • BATCH: The batch size used to batch generate the samples. Default is 1.

After running the script, the result json can be found in {LOG_DIR}/{EXP_NAME}/test_results/.

AI City Challenge 2024 Results

To replicate the results on Track 2 of the AI City Challenge 2024, first download all high_fps_all (epoch 25) and local_temp_all (epoch 20, 30) checkpoints from the checkpoint section.

For the internal WTS blind test set:

python generate_test.py high_fps_all

For the external WTS blind test set (you might need higher GPU memory to run the ensembling):

python ensemble.py ensemble

Merge the two result jsons in {LOG_DIR}/high_fps_all/test_results/ and {LOG_DIR}/ensemble/ to get the final result json.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@InProceedings{Dinh_2024_CVPR,
    author    = {Dinh, Quang Minh and Ho, Minh Khoi and Dang, Anh Quan and Tran, Hung Phong},
    title     = {TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2024},
    pages     = {7134-7143}
}