Key ‘use_ema_weights_to_init_param' is not in struct #126

L1-M1ng · 2022-06-15T01:43:16Z

Hi! Thanks for your amazing work. I' m trying to run the train_vqa_distributed.sh with the checkpoint vqa_base_best.pt downloaded from your url, but I meet this error. How can I fix it? (I ’ m using the fairseq repo provided by you)

The text was updated successfully, but these errors were encountered:

L1-M1ng · 2022-06-15T02:03:58Z

This was the script i used.

#!/usr/bin/env

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training). 
# Please set the options below according to the comments. 
# For multi-gpu workers training, these options should be manually set for each worker. 
# After setting the options, please run the script on each worker.
# To use the shuffled data (if exists), please uncomment the Line 24.

# Number of GPUs per GPU worker
GPUS_PER_NODE=2    # 8 
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1       # 4 
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=127.0.0.1
# The port for communication
export MASTER_PORT=8214
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=0

data_dir=../../dataset/vqa_data
data=${data_dir}/vqa_train.tsv,${data_dir}/vqa_val.tsv
# Note: If you have shuffled the data in advance, please uncomment the line below.
# data=${data_dir}/vqa_train_1.tsv,${data_dir}/vqa_train_2.tsv,${data_dir}/vqa_train_3.tsv,${data_dir}/vqa_train_4.tsv,${data_dir}/vqa_train_5.tsv,${data_dir}/vqa_train_6.tsv,${data_dir}/vqa_train_7.tsv,${data_dir}/vqa_train_8.tsv,${data_dir}/vqa_train_9.tsv,${data_dir}/vqa_train_10.tsv,${data_dir}/vqa_val.tsv
ans2label_file=../../dataset/vqa_data/trainval_ans2label.pkl
restore_file=../../checkpoints/ofa_base_bast.pt
selected_cols=0,5,2,3,4

log_dir=./vqa_logs
save_dir=./vqa_checkpoints
mkdir -p $log_dir $save_dir

bpe_dir=../../utils/BPE
user_dir=../../ofa_module

task=vqa_gen
arch=ofa_base
criterion=adjust_label_smoothed_cross_entropy
label_smoothing=0.1
batch_size=4
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.2
decoder_drop_path_rate=0.2
dropout=0.1
attention_dropout=0.0
max_src_length=80
max_object_length=30
max_tgt_length=30
num_bins=1000
patch_image_size=480

uses_ema="--uses-ema"
store_ema="--store-ema"
ema_fp32="--ema-fp32"
ema_decay=0.9999
ema_start_update=0

# Specify the inference type in validation after each fine-tuning epoch
# As mentioned in the readme, you can choose from allcand or beamsearch evaluation, default to allcand
val_inference_type=allcand

for total_num_updates in {40000,}; do
  echo "total_num_updates "${total_num_updates}
  for warmup_updates in {1000,}; do
    echo "warmup_updates "${warmup_updates}  
    for lr in {5e-5,}; do
      echo "lr "${lr}
      for patch_image_size in {480,}; do
        echo "patch_image_size "${patch_image_size}

        log_file=${log_dir}/${total_num_updates}"_"${warmup_updates}"_"${lr}"_"${patch_image_size}"_rank"${RANK}".log"
        save_path=${save_dir}/${total_num_updates}"_"${warmup_updates}"_"${lr}"_"${patch_image_size}
        mkdir -p $save_path

        python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} ../../train.py \
            ${data} \
            --selected-cols=${selected_cols} \
            --bpe-dir=${bpe_dir} \
            --user-dir=${user_dir} \
            --restore-file=${restore_file} \
            --reset-optimizer --reset-dataloader --reset-meters \
            --save-dir=${save_path} \
            --task=${task} \
            --arch=${arch} \
            --criterion=${criterion} \
            --label-smoothing=${label_smoothing} \
            --batch-size=${batch_size} \
            --update-freq=${update_freq} \
            --encoder-normalize-before \
            --decoder-normalize-before \
            --share-decoder-input-output-embed \
            --share-all-embeddings \
            --layernorm-embedding \
            --patch-layernorm-embedding \
            --code-layernorm-embedding \
            --resnet-drop-path-rate=${resnet_drop_path_rate} \
            --encoder-drop-path-rate=${encoder_drop_path_rate} \
            --decoder-drop-path-rate=${decoder_drop_path_rate} \
            --dropout=${dropout} \
            --attention-dropout=${attention_dropout} \
            --weight-decay=0.01 \
            --optimizer=adam \
            --adam-betas="(0.9,0.999)" \
            --adam-eps=1e-08 \
            --clip-norm=1.0 \
            --lr-scheduler=polynomial_decay \
            --lr=${lr} \
            --total-num-update=${total_num_updates} \
            --warmup-updates=${warmup_updates} \
            --log-format=simple \
            --log-interval=10 \
            --fixed-validation-seed=7 \
            --keep-last-epochs=15 \
            --save-interval=1 --validate-interval=1 \
            --max-update=${total_num_updates} \
            --best-checkpoint-metric=vqa_score --maximize-best-checkpoint-metric \
            --max-src-length=${max_src_length} \
            --max-object-length=${max_object_length} \
            --max-tgt-length=${max_tgt_length} \
            --find-unused-parameters \
            --freeze-encoder-embedding \
            --freeze-decoder-embedding \
            --ans2label-file=${ans2label_file} \
            --valid-batch-size=20 \
            --add-type-embedding \
            --scale-attn \
            --scale-fc \
            --scale-heads \
            --disable-entangle \
            --num-bins=${num_bins} \
            --patch-image-size=${patch_image_size} \
            --prompt-type=prev_output \
            --fp16 \
            --fp16-scale-window=512 \
            --add-object \
            ${uses_ema} \
            ${store_ema} \
            ${ema_fp32} \
            --ema-decay=${ema_decay} \
            --ema-start-update=${ema_start_update} \
            --val-inference-type=${val_inference_type} \
            --num-workers=0 > ${log_file} 2>&1
      done
    done
  done
done

yangapku · 2022-06-15T02:33:28Z

Hi, in the script you provided, I notice that the restore_file you specified is ofa_base_bast.pt rather than the finetuned vqa_base_best.pt which you mentioned. I want to make sure which one are you using and finding this error? Typically we suggest to use the pretrained ofa_base.pt checkpoint to perform VQA finetuning, which can reproduce our reported performance using the released script. The ckpt vqa_base_best.pt is already finetuned with the highest validation score on VQA and is expected to be directly used for inference and evaluation. Are you trying to perform continuous finetuning?

L1-M1ng · 2022-06-15T02:57:51Z

hi, thanks for your answer. I downloaded and used the model ofa_base.pt just now, but i still met this error

.

L1-M1ng · 2022-06-15T03:09:06Z

I use print(self.cfg.checkpoint in trainer.py. I find this dict do not have the key use_ema_weight_to_init_param

yangapku · 2022-06-15T07:47:01Z

Hi, I have just tested to run this script on my environment, which ran successfully. Please make sure that you pull and use the latest OFA code and ofa_base.pt. After checking the OFA code is the latest, I would suggest to make a reinstall of fairseq. Please first run pip uninstall fairseq to uninstall the old fairseq and then use pip install -r requirements.txt.

L1-M1ng · 2022-06-16T06:54:34Z

Thanks a lot ! It work!
But I met gradient overflow problem when i training on VizWiz dataset and the loss is nan

This is my trainval_ans2label.pkl file

L1-M1ng · 2022-06-16T07:29:43Z

The number of answers appeared in VizWiz dataset is 48000+, is huge number, and make the number of neural units in OFAClassificationHead is huge, too . Can I only choose some answer appear frequently for train? If can, how to modify the code ?

yangapku · 2022-06-16T08:32:03Z

Hi, actually the nan issue comes from the preparation of trainval_ans2label.pkl file. It should be constructed specially for VizWiz. In detail, each ground truth answer must be included in this file, otherwise the loss will be nan. This issue has been proposed before. Please refer to this issue #105 for more information.

yangapku closed this as completed Jun 24, 2022

SilverSolver mentioned this issue Jun 28, 2022

"Key 'use_ema_weights_to_init_param' is not in struct"/"please ensure that the architectures match" when trying to fine-tune ofa-base #146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Key ‘use_ema_weights_to_init_param' is not in struct #126

Key ‘use_ema_weights_to_init_param' is not in struct #126

L1-M1ng commented Jun 15, 2022

L1-M1ng commented Jun 15, 2022

yangapku commented Jun 15, 2022 •

edited

Loading

L1-M1ng commented Jun 15, 2022

L1-M1ng commented Jun 15, 2022

yangapku commented Jun 15, 2022

L1-M1ng commented Jun 16, 2022

L1-M1ng commented Jun 16, 2022

yangapku commented Jun 16, 2022

Key ‘use_ema_weights_to_init_param' is not in struct #126

Key ‘use_ema_weights_to_init_param' is not in struct #126

Comments

L1-M1ng commented Jun 15, 2022

L1-M1ng commented Jun 15, 2022

yangapku commented Jun 15, 2022 • edited Loading

L1-M1ng commented Jun 15, 2022

L1-M1ng commented Jun 15, 2022

yangapku commented Jun 15, 2022

L1-M1ng commented Jun 16, 2022

L1-M1ng commented Jun 16, 2022

yangapku commented Jun 16, 2022

yangapku commented Jun 15, 2022 •

edited

Loading