This is the code for the NeurIPS 2018 paper VideoCapsuleNet: A Simplified Network for Action Detection.
The paper can be found here: http://papers.nips.cc/paper/7988-videocapsulenet-a-simplified-network-for-action-detection
The network is implemented using TensorFlow 1.4.1.
Python packages used: numpy, scipy, scikit-video
- caps_layers.py: Contains the functions required to construct capsule layers - (primary, convolutional, and fully-connected).
- caps_network.py: Contains the VideoCapsuleNet model.
- caps_main.py: Contains the main function, which is called to train the network.
- config.py: Contains several different hyperparameters used for the network, training, or inference.
- get_iou.py: Contains the function used to evaluate the network.
- inference.py: Contains the inference code.
- load_ucf101_data.py: Contains the data-generator for UCF-101.
- output2.txt: This is a sample output file for training and testing
We have supplied the code for training and testing the model on the UCF-101 dataset. The file load_ucf101_data.py
creates two DataLoaders - one for training and one for testing. The dataset_dir
variable at the top of the file should be set to the base directory which contains the frames and annotations..
To run this code, you need to do the following:
- Download the UCF-101 dataset at http://crcv.ucf.edu/data/UCF101.php
- Extract the frames from each video (downsized to 160x120), and store them as .jpeg files, with the names "frame_K.jpg" where K is the frame number, from 0 to T-1. The path to the frames should be:
[dataset_dir]/UCF101_Frames/[Video Name]/frame_K.jpg
.
- Extract the frames from each video (downsized to 160x120), and store them as .jpeg files, with the names "frame_K.jpg" where K is the frame number, from 0 to T-1. The path to the frames should be:
- Download the trainAnnot.mat and testAnnot.mat Annotations from https://github.com/gurkirt/corrected-UCF101-Annots and the path to the annotations should be
[dataset_dir]/UCF101_Annotations/*.mat
Once the data is set up you can train (and test) the network by calling python3 caps_main.py
.
To get similar results found in the paper, the pretrained C3D weights are needed (see readme.txt
) in the pretrained_weights folder.
The config.py
file contains several hyper-parameters which are useful for training the network.
During training and testing, metrics are printed to stdout as well as an output*.txt file. During training/validation, the losses and accuracies are printed out. At test time, the accuracy, f-mAP and v-mAP scores (for many IoU thresholds), and f-AP@IoU=0.5 and v-AP@IoU=0.5 for each class, are printed out.
An example of this is found in output2.txt
. These are not the same results as those found in the paper (since cleaning the code led to different variable names, so using the same weights would be difficult to transfer) but they are comparable.
As the network is trained, the best weights are being saved to the network_saves folder. The weights for the network trained on UCF-101 can be found here. Unzip the file and place the three .ckpt files in the network_saves folder. These weights correspond the the results found in output2.txt
.
If you just want to test the model using the weights above, uncomment #iou()
at the bottom of the get_iou.py
file, and run python3 get_iou.py
.
If you just want to obtain the segmentation for a single video, you can use inference.py
. An example video from UCF-101 is given.
Running inference.py
saves the cropped video (first resized to HxW=120x160 and cropped to HxW=112x112) as well as the segmented video: cropped_vid.avi
and segmented_vid.avi
respectively.