This repo provides data downloads and baseline codes for the MTVG and MDVC sub-challenges in the 4th PIC Challenge held in conjunction with ACM MM 2022.
If you have any questions, please contact us with [email protected].
YouMakeup is a large-scale multimodal instructional video dataset introduced in paper: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension (EMNLP2019).
It contains 2,800 videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of steps, including temporal boundaries, grounded facial areas and natural language descriptions of each step.
For more details, you can find them in YouMakeup Dataset.
In the PIC Challenges 2022 , we use the following data split:
# Total | # Train | # Val | #Test | Video_len |
---|---|---|---|---|
2800 | 1680 | 280 | 840 | 15s-1h |
We provide the pre-processed features of raw videos using c3d and i3d:
makeup_c3d_rgb_stride_1s.zip in: google drive or baidu_cloud password:hcw8
makeup_i3d_rgb_stride_1s.zip in: google drive or baidu_cloud password:nrlu
(Optional) You can sign the data license form and send it to [email protected] and we will provide extracted frames (three frames per second) from the original 2800 videos.
Annotations of videos on train/val set are here or BaiduNetDisk, password:31jo. An annotated example in .json file is as follows:
{
"video_id": "-2FjMSPITn8", # video id
"name": "Easy_Foundation_Routine_MakeupShayla--2FjMSPITn8.mp4", # video name
"duration": 434.1003333333333, # total video length (second)
"step": # annoated make-up steps
{
"1": { # step id
"query_idx": "1", # unique id of (video, step query) for grounding task
"area": ["face"], # involved face regions
"caption": "Apply foundation on face with brush", # step caption
"startime": "00:01:36", # start time of the step
"endtime": "00:02:49" # end time of the step
},
...
},
}
[Note] The test set will be released at June 10, 2022.
Given an untrimmed make-up video and a step query, the Make-up Temporal Video Grounding(MTVG) aims to localize the target make-up step in the video. This task requires models to align fine-grained video-text semantics and distinguish makeup steps with subtle difference.
We adopt “R@n, IoU=m” with n in {1} and m in {0.3, 0.5, 0.7} as evaluation metrics. It means that the percentage of at least one of the top-n results having Intersection over Union (IoU) with the ground truth larger than m.
The evaluation code used by the evaluation server can be found here.
Participants need to submit the timestamp candidate for each (video, text query) input. The results should be stored in results.json, with the following format:
{
"query_idx": [start_time, end_time],
...
}
[Note] The submission site will be opened at June 10, 2022.
For this task, we provided the code implementation from Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding. To reproduce it, please refer to the MTVG folder.
The results of the baseline model on val set is:
Feature | R@1,IoU=0.3 | R@1,IoU=0.5 | R@1,IoU=0.7 | R@5,IoU=0.3 | R@5,IoU=0.5 | R@5,IoU=0.7 |
---|---|---|---|---|---|---|
C3D | 33.47 | 23.05 | 11.78 | 63.28 | 48.88 | 25.07 |
I3D | 48.09 | 35.18 | 20.08 | 76.79 | 64.13 | 36.25 |
Given an untrimmed make-up video, the Make-up Dense Video Captioning (MDVC) task aims to localize and describe a sequence of makeup steps in the target video. This task requires models to both detect and describe fine-grained make-up events in a video.
We measure both localizing and captioning ability of models. For localization performance, we compute the average precision (AP) across tIoU thresholds of {0.3,0.5,0.7,0.9}. For dense captioning performance, we calculate BLEU4, METEOR and CIDEr of the matched pairs between generated captions and the ground truth across tIOU thresholds of {0.3, 0.5, 0.7, 0.9}.
The evaluation code used by the evaluation server can be found here.
Please use the following JSON format when submitting your results.json for the challenge:
{
"video_id": [
{
"sentence": sent,
"timestamp": [st_time, end_time]
},
...
]
}
The submission site will be opened at June 10, 2022.
We prepared the baseline for this task, it is from End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021). To reproduce it, please refer to the MDVC folder.
The results of the baseline model on val set is:
Model | Features | Recall | Precision | METEOR | BLEU4 | CIDEr |
---|---|---|---|---|---|---|
PDVC_light | C3D | 21.16 | 26.41 | 9.44 | 3.80 | 41.22 |
PDVC_light | I3D | 23.75 | 31.47 | 12.48 | 6.29 | 68.18 |
@inproceedings{chen2020vqabaseline,
title={YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos,
author={Chen, Shizhe and Wang, Weiying and Ruan, Ludan and Yao, Linli and Jin, Qin},
year={2019}
}