This repository includes the codes of the paper: Structured Learning for Action Recognition in Videos
The two-stream CNNs we used to extract features are based on this pytorch implementation(https://github.com/jeffreyhuang1/two-stream-action-recognition.git).
Actions in continuous videos are correlated and may have hierarchical relationships. Densely labeled datasets of complex videos have revealed the simultaneous occurrence of actions, but existing models fail to make use of the relationships to analyze actions in the context of videos and better understand complex videos. We propose a novel architecture consisting of a correlation learning and input synthesis (CoLIS) network, long short-term memory (LSTM), and a hierarchical classifier. First, the CoLIS network captures the correlation between features extracted from video sequences and pre-processes the input to the LSTM. Since the input becomes the weighted sum of multiple correlated features, it enhances the LSTM's ability to learn variable-length long-term temporal dependencies. Second, we design a hierarchical classifier which utilizes the simultaneous occurrence of general actions such as run and jump to refine the prediction on their correlated actions. Third, we use interleaved backpropagation through time for training. All these networks are fully differentiable so that they can be integrated for endto-end learning. The results show that the proposed approach improves action recognition accuracy by 1.0% and 2.2% on single-labeled or densely labeled datasets respectively.
$ git clone https://github.com/yinghanlong/action-recognition-video.git
Download and directly use the output features (from the last layer of ResNet101, dim=4096 per frame) of UCF101 processed by a two-stream CNN.
-
- We extract RGB frames from each video in UCF101 dataset with sampling rate: 10 and save as .jpg image in disk which cost about 5.9G.
In motion stream, we use two methods to get optical flow data.
- Download the preprocessed tvl1 optical flow dataset directly from https://github.com/feichtenhofer/twostreamfusion.
- Using flownet2.0 method to generate 2-channel optical flow image and save its x, y channel as .jpg image in disk respectively, which cost about 56G.
(Alternative)Download the preprocessed data directly from feichtenhofer/twostreamfusion)
- RGB images
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.001 wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.002 wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_jpegs_256.zip.003 cat ucf101_jpegs_256.zip* > ucf101_jpegs_256.zip unzip ucf101_jpegs_256.zip
- Optical Flow
wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_tvl1_flow.zip.001 wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_tvl1_flow.zip.002 wget http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ucf101_tvl1_flow.zip.003 cat ucf101_tvl1_flow.zip* > ucf101_tvl1_flow.zip unzip ucf101_tvl1_flow.zip
After setting up the enviroment, you can extract features with a pretrained two-stream CNN or directly use the features we provide. Then you can train the proposed model (attention-enhanced LSTM) using the features as inputs. To train with UCF101 dataset,
$ python lstm-cor-2.py --resume PATH-TO-MODEL --epoches=50 --lr=5e-4 --top5enhance
To train with multithumos,
$ python lstm-multithmos.py --resume PATH-TO-MODEL --epoches=50 --lr=5e-4 --top5enhance --dataset=multithumos
If you want to use the vanilla LSTM, do not set --top5enhance
or use lstm-ori.py
.
Please contact us at [email protected] if you encounter any problem with using this repository.
network | UCF101 Top1 Accuracy | Multithumos mAP |
---|---|---|
Spatial cnn | 82.1% | - |
Motion cnn | 79.4% | - |
Two stream CNN | 88.5% | 27.6% |
Two stream + LSTM | 89.8% | 29.6% |
Our work | 90.8% | 31.8% |
- Please modify this path and this funcition to fit the UCF101/multithumos dataset on your device.
- Training and testing
python spatial_cnn.py --resume PATH_TO_PRETRAINED_MODEL
- Only testing
python spatial_cnn.py --resume PATH_TO_PRETRAINED_MODEL --evaluate
- Please modify this path and this funcition to fit the UCF101/multithumos dataset on your device.
- Training and testing
python motion_cnn.py --resume PATH_TO_PRETRAINED_MODEL
- Only testing
python motion_cnn.py --resume PATH_TO_PRETRAINED_MODEL --evaluate
If you use this repo for your work, please use the following citation:
@ARTICLE{8805090,
author={Long, Yinghan and Srinivasan, Gopalakrishnan and Panda, Priyadarshini and Roy, Kaushik},
journal={IEEE Journal on Emerging and Selected Topics in Circuits and Systems},
title={Structured Learning for Action Recognition in Videos},
year={2019},
volume={9},
number={3},
pages={475-484},
doi={10.1109/JETCAS.2019.2935004}}