Skip to content

Latest commit

 

History

History
74 lines (64 loc) · 2.74 KB

README.md

File metadata and controls

74 lines (64 loc) · 2.74 KB

2D-MapFormer

image

Source Code for my master thesis "2D-MapFormer: 2D-Map Transformer for Audio-Visual Scene-Aware Dialogue and Reasoning" (Currently not published).

The Source Code is derived from

  • AVSD-DSTC10 Baseline: Link
  • 2D-Tan module: Link

Usage

  1. Requirments
    • conda
    • wandb
  2. Environments Setting
    . ./setup.sh
    
  3. Download I3D and VGGish pretrained features
    . ./download_data.sh
    python3 utils/combine_files.py # combine feature files into ./data/features/train.pkl and ./data/features/test.pkl
    
  4. Train model
    1. Specify the exp_name in the run.sh. The trained model and model outputs will stored in ./log/{exp_name}/. It will also be the experiment name of wandb
    2. Specify the procedure='train_test'
    3. Specify other hyperparameters. Please see run.sh and main.py for more details.
    4. run . ./run.sh.
      1. It will run training and testing automatically
      2. You will see the following procedure in the command line
        train 15, tan:0.125, dig:2.272: 100%|█████| 4787/4787 [21:15<00:00,  3.75it/s]
        train 15, tan:0.112, dig:2.153
        val   15, tan:0.087, dig:1.985: 100%|█████| 1117/1117 [06:12<00:00,  3.00it/s]
        val   15, tan:0.109, dig:2.295
        The best metric was  for 0 epochs.
        Expected early stop @ 19
        train 16, tan:0.094, dig:2.097: 100%|█████| 4787/4787 [21:10<00:00,  3.77it/s]
        train 16, tan:0.112, dig:2.136
        val   16, tan:0.088, dig:2.005: 100%|█████| 1117/1117 [06:11<00:00,  3.01it/s]
        val   16, tan:0.109, dig:2.298
        
      3. You will see the following test result in the command line
        DSTC10_beam_search result:
        | Bleu_1: 68.7000
        | Bleu_2: 55.5832
        | Bleu_3: 45.4938
        | Bleu_4: 37.5887
        | METEOR: 24.3038
        | ROUGE_L: 53.4955
        | CIDEr: 86.9928
        | IoU-1: 54.7007
        | IoU-2: 57.6148
        

Model Architecture

image
Model Overview
image
Audio Visual Encoder
image
Sentence Cross Attention
image
Update Gate