This is PyTorch implementation of two stream network of action classification on Kinetics dataset. We train two streams of networks independently on individual(or stacked) frames of RGB (appearence) and optical flow (flow) as inputs.
Objective of this repository to establish a two stream baseline and ease the training process on such a huge dataset.
- Install PyTorch by selecting your environment on the website and running the appropriate command.
- Install
ffmpeg
- Please install cv2 as well for your python. I recommend using anaconda 3.6 and menpo's opnecv3 package.
- Clone this repository.
- Note: We currently only support Python 3+ on Linux system
- We also support Visdom
for visualization of loss and accuracy on subset of validation set during training!
- To use Visdom in the browser:
# First install Python server and client pip install visdom # Start the server (probably in a screen or tmux) python -m visdom.server --port=8097
- Then (during training) navigate to http://localhost:8097/ (see the Training section below for more details).
Kinetics dataset can be Downloaded using Crawler.
Notes:
- Use latest youtube-dl
- Some video might be missing but you should be alright, if are able to download around 290K videos.
First we need to extract images out of videos using ffmpeg
and resave the annotations,
so that annotations are compatible with this code.
You can take help of scripts in prep
folder in the repo to do both the things.
You need to compute optical flow images using optical-flow.
Compute farneback
flow as it is much faster to compute and gives reasonable results.
You might want to run multiple processes in parallel.
- Download the pretrained weight for InceptionV3
and VGG-16,
place them in same directory which will hold pertained models and set
global_models_dir
intrain.py
. - By default, we assume that you have downloaded that dataset.
- To train the network of your choice simply specify the parameters listed in
train.py
as a flag or manually change them.
Let's assume that you extracted dataset in /home/user/kinetics/
directory then your train command from the root directory of this repo is going to be:
CUDA_VISIBLE_DEVICES=0 python train.py --root=/home/user/kinetics/ --global_models_dir=/home/user/pretrained-models/
--visdom=True --input_type=rgb --stepvalues=200000,350000 --max_iterations=500000
To train of flow inputs
CUDA_VISIBLE_DEVICES=1 python train.py --root=/home/user/kinetics/ global_models_dir=/home/user/pretrained-models/
--visdom=True --input_type=farneback --stepvalues=250000,400000 --max_iterations=500000
Different paramneter in train.py
will result in different performance
- Note:
- InceptionV3 occupies almost 8.5GB VRAM on a GPU, raining can take from 2-4 days depending upon the disk, cpu and gpu speed. I used one 1080Ti gpu, SSD-PCIe hard-drive and an i7 cpu. Disk operation could be a bottleneck if you are using HDD.
- For instructions on Visdom usage/installation, see the Installation section. By default it is off.
- If you don't like to use visdom then you always keep track of train using logfile which is saved under save_root directory
- During training checkpoint is saved every 25K iteration also log it's frame-level
top1 & top3
accuracies on a subset of 95k validation images. - We recommend to training for 500K iterations for all the input types.
You can use test.py
to generate frame-level scores and save video-level results in json file.
Further use eval.py
to evaluate results on validation set
Once you have trained network then you can use test.py
to generate frame-level scores.
Simply specify the parameters listed in test.py
as a flag or manually change them. for e.g.:
CUDA_VISIBLE_DEVICES=0 python3 test.py --root=/home/user/kinetics/ --input=rgb --test-iteration=500000
-Note
- By default it will compute frame-level scores and store them
as well as compute frame-level
top1 & top3
accuracies using model from 60K-th iteration. - There is a log file file created for frame-level evaluation.
Video-level labling requires frame-level scores.
test.py
not only store frame-level score but also video-level scores in evaluate
function within. It will dump the video level output in json format
(same a used in activtiyNet challenge) for validation set.
Now you can specify the parameter in eval.py
and evaluate
Table below records the performance of resnet101
model on Mini-Kinetics datasets. It is trained for 60K iteration with learning rate of 0.0005
and a drop by factor of 10 after 25000,40000,55000
.
Batch size used is 64.
method | frame-top1 | frame-top3 | video-top1 | video-top5 | video-AVG | video-mAP |
Resnet101-RGB | 61.5 | 77.9 | 75.7 | 92.2 | 83.9 | 78.1 |
Pre-trained models can be downloaded from the links given below.
You will need to make changes in test.py
to accept the downloaded weights.
- Currently, we provide the following PyTorch models:
- InceptionV3 trained on kinectics ; available from my google drive
- appearence model trained on rgb-images (named
rgb_OneFrame_model_500000
) - accurate flow model trained on farneback-images (named
farneback_OneFrame_model_500000
)
- appearence model trained on rgb-images (named
- InceptionV3 trained on kinectics ; available from my google drive
- fill the table with fused results
- [1] Kay, Will, et al. "The Kinetics Human Action Video Dataset." arXiv preprint arXiv:1705.06950 (2017).