2D-MapFormer

Source Code for my master thesis "2D-MapFormer: 2D-Map Transformer for Audio-Visual Scene-Aware Dialogue and Reasoning" (Currently not published).

The Source Code is derived from

AVSD-DSTC10 Baseline: Link
2D-Tan module: Link

Usage

Requirments
- conda
- wandb
Environments Setting
```
. ./setup.sh
```

Download I3D and VGGish pretrained features

. ./download_data.sh
python3 utils/combine_files.py # combine feature files into ./data/features/train.pkl and ./data/features/test.pkl

Train model

Specify the exp_name in the run.sh. The trained model and model outputs will stored in ./log/{exp_name}/. It will also be the experiment name of wandb
Specify the procedure='train_test'
Specify other hyperparameters. Please see run.sh and main.py for more details.

run . ./run.sh.

It will run training and testing automatically

You will see the following procedure in the command line

train 15, tan:0.125, dig:2.272: 100%|█████| 4787/4787 [21:15<00:00,  3.75it/s]
train 15, tan:0.112, dig:2.153
val   15, tan:0.087, dig:1.985: 100%|█████| 1117/1117 [06:12<00:00,  3.00it/s]
val   15, tan:0.109, dig:2.295
The best metric was  for 0 epochs.
Expected early stop @ 19
train 16, tan:0.094, dig:2.097: 100%|█████| 4787/4787 [21:10<00:00,  3.77it/s]
train 16, tan:0.112, dig:2.136
val   16, tan:0.088, dig:2.005: 100%|█████| 1117/1117 [06:11<00:00,  3.01it/s]
val   16, tan:0.109, dig:2.298

You will see the following test result in the command line

DSTC10_beam_search result:
| Bleu_1: 68.7000
| Bleu_2: 55.5832
| Bleu_3: 45.4938
| Bleu_4: 37.5887
| METEOR: 24.3038
| ROUGE_L: 53.4955
| CIDEr: 86.9928
| IoU-1: 54.7007
| IoU-2: 57.6148

Model Architecture


Model Overview


Audio Visual Encoder


Sentence Cross Attention


Update Gate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

2D-MapFormer

Usage

Model Architecture

Files

README.md

Latest commit

History

README.md

File metadata and controls

2D-MapFormer

Usage

Model Architecture