Skip to content

overwindows/YouMakeup_Challenge

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

4th PIC Challenge(MTVG & MDVC)

logo

This repo provides data downloads and baseline codes for the MTVG and MDVC sub-challenges in the 4th PIC Challenge held in conjunction with ACM MM 2022.

If you have any questions, please contact us with [email protected].

Dataset Introduction

YouMakeup is a large-scale multimodal instructional video dataset introduced in paper: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension (EMNLP2019).

It contains 2,800 videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of steps, including temporal boundaries, grounded facial areas and natural language descriptions of each step.

annotation

For more details, you can find them in YouMakeup Dataset.

In the PIC Challenges 2022 , we use the following data split:

# Total # Train # Val #Test Video_len
2800 1680 280 840 15s-1h

Data Download

We provide the pre-processed features of raw videos using c3d and i3d:

makeup_c3d_rgb_stride_1s.zip in: google drive or baidu_cloud password:hcw8

makeup_i3d_rgb_stride_1s.zip in: google drive or baidu_cloud password:nrlu

(Optional) You can sign the data license form and send it to [email protected] and we will provide extracted frames (three frames per second) from the original 2800 videos.

Annotations of videos on train/val set are here or BaiduNetDisk, password:31jo. An annotated example in .json file is as follows:

{
    "video_id": "-2FjMSPITn8", # video id
    "name": "Easy_Foundation_Routine_MakeupShayla--2FjMSPITn8.mp4", # video name
    "duration": 434.1003333333333, # total video length (second)
    "step": # annoated make-up steps
        {
          "1": { # step id
                 "query_idx": "1",  # unique id of (video, step query) for grounding task
                 "area": ["face"],  # involved face regions
                 "caption": "Apply foundation on face with brush", # step caption
                 "startime": "00:01:36", # start time of the step
                 "endtime": "00:02:49" # end time of the step
               },
          ...
         },
}

[Note] The test set will be released at June 10, 2022.

Make-up Temporal Video Grounding Sub-Challenge

Given an untrimmed make-up video and a step query, the Make-up Temporal Video Grounding(MTVG) aims to localize the target make-up step in the video. This task requires models to align fine-grained video-text semantics and distinguish makeup steps with subtle difference.

theme_mtvg

Evaluation Metric

We adopt “R@n, IoU=m” with n in {1} and m in {0.3, 0.5, 0.7} as evaluation metrics. It means that the percentage of at least one of the top-n results having Intersection over Union (IoU) with the ground truth larger than m.

The evaluation code used by the evaluation server can be found here.

Submission Format

Participants need to submit the timestamp candidate for each (video, text query) input. The results should be stored in results.json, with the following format:

{
    "query_idx": [start_time, end_time],
     ...
}

[Note] The submission site will be opened at June 10, 2022.

Baseline

For this task, we provided the code implementation from Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding. To reproduce it, please refer to the MTVG folder.

The results of the baseline model on val set is:

Feature R@1,IoU=0.3 R@1,IoU=0.5 R@1,IoU=0.7 R@5,IoU=0.3 R@5,IoU=0.5 R@5,IoU=0.7
C3D 33.47 23.05 11.78 63.28 48.88 25.07
I3D 48.09 35.18 20.08 76.79 64.13 36.25

Make-up Dense Video Captioning Sub-Challenge

Given an untrimmed make-up video, the Make-up Dense Video Captioning (MDVC) task aims to localize and describe a sequence of makeup steps in the target video. This task requires models to both detect and describe fine-grained make-up events in a video.

theme_mdvc

Evaluation Metric

We measure both localizing and captioning ability of models. For localization performance, we compute the average precision (AP) across tIoU thresholds of {0.3,0.5,0.7,0.9}. For dense captioning performance, we calculate BLEU4, METEOR and CIDEr of the matched pairs between generated captions and the ground truth across tIOU thresholds of {0.3, 0.5, 0.7, 0.9}.

The evaluation code used by the evaluation server can be found here.

Submission Format

Please use the following JSON format when submitting your results.json for the challenge:

{
    "video_id": [
        {
            "sentence": sent,
            "timestamp": [st_time, end_time]
        },
		...
    ]
}

The submission site will be opened at June 10, 2022.

Baseline

We prepared the baseline for this task, it is from End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021). To reproduce it, please refer to the MDVC folder.

The results of the baseline model on val set is:

Model Features Recall Precision METEOR BLEU4 CIDEr
PDVC_light C3D 21.16 26.41 9.44 3.80 41.22
PDVC_light I3D 23.75 31.47 12.48 6.29 68.18

Citation

@inproceedings{chen2020vqabaseline,
  title={YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos,
  author={Chen, Shizhe and Wang, Weiying and Ruan, Ludan and Yao, Linli and Jin, Qin},
  year={2019}
}

About

I am working on YouMakeUp MTVG Task :-D

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 87.0%
  • Cuda 11.5%
  • C++ 1.1%
  • Shell 0.4%