We propose a Within-Triplet Transformer-based CRF model WT-CRF to generate dynamic scene graphs of the given video. WT-CRF computes the unary and temporal potential of a relationship pair given the local-global within-triplet features and combines these potentials with predicted weights in a Conditional Random Field (CRF) framework.
We followed the installation instructions from Cong's STTran repo.
- python=3.6
- pytorch=1.1
- scipy=1.1.0
- torchvision=0.3
- cypthon
- dill
- easydict
- h5py
- opencv
- pandas
- tqdm
- yaml
We borrow some compiled code for bbox operations.
cd lib/draw_rectangles
python setup.py build_ext --inplace
cd ..
cd fpn/box_intersections_cpu
python setup.py build_ext --inplace
For the object detector part, please follow the compilation from https://github.com/jwyang/faster-rcnn.pytorch We provide a pretrained FasterRCNN model for Action Genome. Please download here and put it in
fasterRCNN/models/faster_rcnn_ag.pth
We use the dataset Action Genome to train/evaluate our method. Please process the downloaded dataset with the Toolkit. The directories of the dataset should look like:
|-- action_genome
|-- annotations #gt annotations
|-- frames #sampled frames
|-- videos #original videos
In the experiments for SGCLS/SGDET, we only keep bounding boxes with short edges larger than 16 pixels. Please download the file object_bbox_and_relationship_filtersmall.pkl and put it in the dataloader
-
For pretrained $BACKBONE_MODEL_PATH (PredCls, SGCls, SGDet) for $mode = predcls, sgcls, sgdet:
- for Training samples
CUDA_VISIBLE_DEVICES=0 python test_backbone_on_training_samples.py -mode $mode -datasize large -data_path dataset/ag/ -backbone_model_path $BACKBONE_MODEL_PATH
For each training video $vid_name with $mode the precomputed results with features will be saved in
'results/' + conf.mode + '_backbone_training/' + vid_name + '.pt'
- for Testing samples
CUDA_VISIBLE_DEVICES=0 python test_backbone_on_testing_samples.py -mode $mode -datasize large -data_path dataset/ag/ -backbone_model_path $BACKBONE_MODEL_PATH
For each testing video $vid_name with $mode the precomputed results with features will be saved in
'results/' + conf.mode + '_backbone_testing/' + vid_name + '.pt'
The environment setting for local feature generation can not be applied for generating DINO based global frame features. Therefore, we precompute the dino features for each frame and dump them as numpy binary files. To install DINO_v2, please visit their github page here.
- For all video samples, run the following script to precompute the DINO_v2 frame features for global context
CUDA_VISIBLE_DEVICES=0 python extract_dino_features_from_frames.py
- For both training and testing in all three settings (predcls, sgcls, sgdet), run the following script to append the local and global features
CUDA_VISIBLE_DEVICES=0 python add_dino_features_to_backbone_results.py
This script will load the local precomputed features for each relationships of each video frames and append the global DINO_v2 features to each of them and save them in another directory with the format 'results/' + conf.mode + '_backbone_training_with_dino/'
(for training) and 'results/' + conf.mode + '_backbone_testing_with_dino/'
(for testing).
You can train the unary and temporal transformer with train_unary_and_temporal.py. We train for 50 epochs.
- For $mode = {predcls, sgcls, sgdet}:
CUDA_VISIBLE_DEVICES=0 python train_unary_and_temporal.py -mode $mode
We have created a validation list of 1000 videos from training set which is not fed to the training procedure. You can run this validation script to report decomposed performance for each epoch and choose the best performing model for the final evaluation on the testing dataset.
- For $mode = {predcls, sgcls, sgdet}:
CUDA_VISIBLE_DEVICES=0 python val_unary_and_temporal.py -mode $mode
With the best unary and temporal model, we train the weight model which predict the weights for unary and temporal clique for each relationship
CUDA_VISIBLE_DEVICES=0 python train_weight.py -unary_model_path $unary_model_path -temporal_model_path $temporal_model_path
We can evaluate the WT-CRF with the following code
CUDA_VISIBLE_DEVICES=0 python test_unary_and_temporal.py -mode predcls -datasize large -data_path dataset/ag/ -backbone_result_folder results/predcls_backbone_with_dino/ -unary_model_path $unary_model_path -temporal_model_path $temporal_model_path -weight_model_path $weight_model_path
CUDA_VISIBLE_DEVICES=0 python test_unary_and_temporal.py -mode sgcls -datasize large -data_path dataset/ag/ -backbone_result_folder results/sgcls_backbone_with_dino/ -unary_model_path $unary_model_path -temporal_model_path $temporal_model_path -weight_model_path $weight_model_path
CUDA_VISIBLE_DEVICES=0 python test_unary_and_temporal.py -mode sgdet -datasize large -data_path dataset/ag/ -backbone_result_folder results/sgdet_backbone_with_dino/ -unary_model_path $unary_model_path -temporal_model_path $temporal_model_path -weight_model_path $weight_model_path