Skip to content
This repository has been archived by the owner on Jun 15, 2023. It is now read-only.

How to infer from new videos ? #12

Closed
lecidhugo opened this issue Sep 16, 2019 · 5 comments
Closed

How to infer from new videos ? #12

lecidhugo opened this issue Sep 16, 2019 · 5 comments

Comments

@lecidhugo
Copy link

Hi @LuoweiZhou,
My question is two-fold:
1)
I downloaded the pre-trained models and I tried first to run the example you provided for inference. I got this error:
IOError: [Errno 2] No such file or directory: u'data/anet/rgb_motion_1d/K6Tm5xHkJ5c_resnet.npy'
Below is the point where the error arises:
Loading the model save/anet-unsup-0-0-0-run1/model-best.pth...
Finetune param: ctx2pool_grd.0.weight
Finetune param: ctx2pool_grd.0.bias
Finetune param: vis_embed.0.weight

I verified that the file is missing but I do not know how to get it ( I saw a similar issue but I could not proceed with the provided answer as it was unclear for me)
2) I am wondering how can I use your code in order to infer from my own videos. Can you please guide me ?
Thanks in advance

@LuoweiZhou
Copy link
Contributor

Hi @lecidhugo, I have updated this issue: #5
Basically, you will need to first pre-process your dataset/annotations (e.g., anet). Then,
extract feature-wise features (for temporal attention) and region features (for region attention), as I described in Issue 5. The dataloader needs to be updated accordingly.

@lecidhugo
Copy link
Author

lecidhugo commented Oct 3, 2019

Hi @LuoweiZhou,
Thank you for sharing your code and for your help.
I finally reproduced your steps correctly.
However, I did not figure out how to pre-process the videos I have for inference. If my understanding is fine, what I have to do as pre-processing is:
1- sampling the video
2- calculate the features of the sampled frames:
2.1- Region features: can be obtained using extract_features.py and Detectron
2.1- Frame-wise features : I have no idea how to calculate them
3- use you code for inference
My goal is to do testing, so I do not need to annotate my videos. Right?
Could you please confirm if my understanding is good ? and how can I do sampling and how can I get the frame-wise features ?
Thanks in advance,

@LuoweiZhou
Copy link
Contributor

Hi @lecidhugo, yes you're right. For the frame-wise features, please refer to this answer. Note that when extracting the region features, we uniformly sample 10 frames from each video segment while for frame-wise features, we sample the entire video at 2fps. Yes, if your end goal is inference/testing, you do not need to have any caption annotations.

@lecidhugo
Copy link
Author

Thank you @LuoweiZhou,
Last question please. How can I produce segments from a video ?
For example, in the following output (which I got from the log), how you divided video "v_K6Tm5xHkJ5c" into two segments ?
segment v_K6Tm5xHkJ5c_segment_00: A woman is seen sitting in a chair holding a
segment v_K6Tm5xHkJ5c_segment_01: The woman then begins playing the accordion while looking back

@LuoweiZhou
Copy link
Contributor

LuoweiZhou commented Oct 4, 2019

@lecidhugo The definition of video segments can be found here. You will see the start/end timestamp of each segment in the annotation file. For short videos, you can also directly feed them into the model. GVD captions each video (segment) independently.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants