TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

Quang Minh Dinh, Minh Khoi Ho, Anh Quan Dang, Hung Phong Tran

This repository contains the official code for the 3th place solution of the 8th AI City Challenge Track 2.

Paper | Slide | Poster | Primary contact: Quang Minh Dinh

Requirements

The codebase is tested on

Ubuntu 20.04
Python 3.10
PyTorch 2.1.0
1 NVIDIA GPU (RTX 3060) with CUDA version 12.1. (Other GPUs are also suitable, and 12GB GPU memory is sufficient to run our code.)

To install requirements:

pip install -r requirements.txt

Setup

All main configurations are stored in config.py. To complete the necessary setups for training:

Change _C.GLOB.EXP_PARENT_DIR to the path to your log directory. All the experimental logs, metrics, samples, and checkpoints will be stored here.
Download the vid2seq_htmchaptersvitt.pth checkpoint from VidChapters (HowTo100M + VidChapters-7M + ViTT). Replace the value of _C.MODEL.VID2SEQ_PATH to the path to the downloaded checkpoint. You can try other Vid2Seq checkpoints. If you want to directly start from the T5 checkpoint, set _C.MODEL.LOAD_VID2SEQ_CKPT to False.
All the dataset-related paths are stored in dataset/config.py. Follow the instructions in the next section to prepare the data for training.
Set _C.SOLVER.LOG_TO_WANDB to False if you don't want to log to Wandb.

Each experiment has a configuration .yml file located in experiments/. You can create your own experiment by adding a {EXP_NAME}.yml file and overwrite the default hyperparameters in config.py in the same manner as experiments/default.yml.

Data

Follow the steps outlined in feature_extraction/README.md to extract the CLIP features for training or download the extracted features here.

Change all the sub_global, local_annotated, and global paths for each dataset configuration in dataset/config.py to the generated feature paths. Change all captions paths to the annotation paths.

Checkpoints

You can download the experiment fine-tuned checkpoints via:

Experiment Name	For Ablation	Download	Descriptions
high_fps_wd	Yes	Google drive link	Sub-global
local_wd	Yes	Google drive link	Sub-global + Local
local_temp_wd	Yes	Google drive link	Sub-global + Local + Phase Encoder
global_main_wd	Yes	Google drive link	Global
global_main_local_wd	Yes	Google drive link	Global + Local + Phase Encoder
global_main_sub	Yes	Google drive link	Global + Sub-global
global_all_wd	Yes	Google drive link	Global + Sub-global + Local + Phase Encoder
high_fps_all	No	Google drive link	Sub-global
local_temp_all	No	Google drive link	Sub-global + Local + Phase Encoder

And put the files in the log directory as follows

{LOG_DIR}
│
├─── high_fps_wd
|       └─ epoch_20.th
├─── local_temp_all
|       ├─ epoch_20.th
|       └─ epoch_30.th
└─── ...

To use a checkpoint for an experiment, either set SOLVER.LOAD_FROM_EPOCH to the checkpoint epoch number or set SOLVER.LOAD_FROM_PATH to the checkpoint path in the experiment .yml file.

Training

To run an experiment:

python train.py {EXP_NAME}

where the experiment EXP_NAME has the corresponding configuration file at experiments/{EXP_NAME}.yml.

Training hyper-parameters like device, batch size, validation interval and save interval can be modified in the configuration file.

To replicate the results in Table 2 and Table 3 of the paper, use the corresponding experiment in the checkpoint table at the previous section for the fine-tuning. All the checkpoints selected for evaluation are provided in the checkpoint table.

Generate Test Samples

Set SOLVER.LOAD_FROM_EPOCH to the checkpoint epoch number or SOLVER.LOAD_FROM_PATH to the checkpoint path in the experiment .yml file.

To generate captions for the samples from the WTS blind test set:

python generate_test.py {EXP_NAME} -d {DEVICE} -b {BATCH}

where:

EXP_NAME: The name of the finished experiment used to generate the test samples.
DEVICE: Decide on which device to run the generation on. Default is cuda.
BATCH: The batch size used to batch generate the samples. Default is 1.

After running the script, the result json can be found in {LOG_DIR}/{EXP_NAME}/test_results/.

AI City Challenge 2024 Results

To replicate the results on Track 2 of the AI City Challenge 2024, first download all high_fps_all (epoch 25) and local_temp_all (epoch 20, 30) checkpoints from the checkpoint section.

For the internal WTS blind test set:

python generate_test.py high_fps_all

For the external WTS blind test set (you might need higher GPU memory to run the ensembling):

python ensemble.py ensemble

Merge the two result jsons in {LOG_DIR}/high_fps_all/test_results/ and {LOG_DIR}/ensemble/ to get the final result json.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@InProceedings{Dinh_2024_CVPR,
    author    = {Dinh, Quang Minh and Ho, Minh Khoi and Dang, Anh Quan and Tran, Hung Phong},
    title     = {TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2024},
    pages     = {7134-7143}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
benchmark		benchmark
dataset		dataset
experiments		experiments
feature_extraction		feature_extraction
images		images
loggers		loggers
models		models
scripts		scripts
solver		solver
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
args.py		args.py
config.py		config.py
ensemble.py		ensemble.py
generate_test.py		generate_test.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

Paper | Slide | Poster | Primary contact: Quang Minh Dinh

Requirements

Setup

Data

Checkpoints

Training

Generate Test Samples

AI City Challenge 2024 Results

Citation

About

Releases

Packages

Languages

License

hmkhoi2701/TrafficVLM

Folders and files

Latest commit

History

Repository files navigation

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

Paper | Slide | Poster | Primary contact: Quang Minh Dinh

Requirements

Setup

Data

Checkpoints

Training

Generate Test Samples

AI City Challenge 2024 Results

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages