Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Key ‘use_ema_weights_to_init_param' is not in struct #126

Closed
L1-M1ng opened this issue Jun 15, 2022 · 8 comments
Closed

Key ‘use_ema_weights_to_init_param' is not in struct #126

L1-M1ng opened this issue Jun 15, 2022 · 8 comments

Comments

@L1-M1ng
Copy link

L1-M1ng commented Jun 15, 2022

Hi! Thanks for your amazing work. I' m trying to run the train_vqa_distributed.sh with the checkpoint vqa_base_best.pt downloaded from your url, but I meet this error. How can I fix it? (I ’ m using the fairseq repo provided by you)

@L1-M1ng
Copy link
Author

L1-M1ng commented Jun 15, 2022

This was the script i used.

#!/usr/bin/env

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training). 
# Please set the options below according to the comments. 
# For multi-gpu workers training, these options should be manually set for each worker. 
# After setting the options, please run the script on each worker.
# To use the shuffled data (if exists), please uncomment the Line 24.

# Number of GPUs per GPU worker
GPUS_PER_NODE=2    # 8 
# Number of GPU workers, for single-worker training, please set to 1
WORKER_CNT=1       # 4 
# The ip address of the rank-0 worker, for single-worker training, please set to localhost
export MASTER_ADDR=127.0.0.1
# The port for communication
export MASTER_PORT=8214
# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
export RANK=0

data_dir=../../dataset/vqa_data
data=${data_dir}/vqa_train.tsv,${data_dir}/vqa_val.tsv
# Note: If you have shuffled the data in advance, please uncomment the line below.
# data=${data_dir}/vqa_train_1.tsv,${data_dir}/vqa_train_2.tsv,${data_dir}/vqa_train_3.tsv,${data_dir}/vqa_train_4.tsv,${data_dir}/vqa_train_5.tsv,${data_dir}/vqa_train_6.tsv,${data_dir}/vqa_train_7.tsv,${data_dir}/vqa_train_8.tsv,${data_dir}/vqa_train_9.tsv,${data_dir}/vqa_train_10.tsv,${data_dir}/vqa_val.tsv
ans2label_file=../../dataset/vqa_data/trainval_ans2label.pkl
restore_file=../../checkpoints/ofa_base_bast.pt
selected_cols=0,5,2,3,4

log_dir=./vqa_logs
save_dir=./vqa_checkpoints
mkdir -p $log_dir $save_dir

bpe_dir=../../utils/BPE
user_dir=../../ofa_module

task=vqa_gen
arch=ofa_base
criterion=adjust_label_smoothed_cross_entropy
label_smoothing=0.1
batch_size=4
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.2
decoder_drop_path_rate=0.2
dropout=0.1
attention_dropout=0.0
max_src_length=80
max_object_length=30
max_tgt_length=30
num_bins=1000
patch_image_size=480

uses_ema="--uses-ema"
store_ema="--store-ema"
ema_fp32="--ema-fp32"
ema_decay=0.9999
ema_start_update=0

# Specify the inference type in validation after each fine-tuning epoch
# As mentioned in the readme, you can choose from allcand or beamsearch evaluation, default to allcand
val_inference_type=allcand

for total_num_updates in {40000,}; do
  echo "total_num_updates "${total_num_updates}
  for warmup_updates in {1000,}; do
    echo "warmup_updates "${warmup_updates}  
    for lr in {5e-5,}; do
      echo "lr "${lr}
      for patch_image_size in {480,}; do
        echo "patch_image_size "${patch_image_size}

        log_file=${log_dir}/${total_num_updates}"_"${warmup_updates}"_"${lr}"_"${patch_image_size}"_rank"${RANK}".log"
        save_path=${save_dir}/${total_num_updates}"_"${warmup_updates}"_"${lr}"_"${patch_image_size}
        mkdir -p $save_path

        python3 -m torch.distributed.launch --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} ../../train.py \
            ${data} \
            --selected-cols=${selected_cols} \
            --bpe-dir=${bpe_dir} \
            --user-dir=${user_dir} \
            --restore-file=${restore_file} \
            --reset-optimizer --reset-dataloader --reset-meters \
            --save-dir=${save_path} \
            --task=${task} \
            --arch=${arch} \
            --criterion=${criterion} \
            --label-smoothing=${label_smoothing} \
            --batch-size=${batch_size} \
            --update-freq=${update_freq} \
            --encoder-normalize-before \
            --decoder-normalize-before \
            --share-decoder-input-output-embed \
            --share-all-embeddings \
            --layernorm-embedding \
            --patch-layernorm-embedding \
            --code-layernorm-embedding \
            --resnet-drop-path-rate=${resnet_drop_path_rate} \
            --encoder-drop-path-rate=${encoder_drop_path_rate} \
            --decoder-drop-path-rate=${decoder_drop_path_rate} \
            --dropout=${dropout} \
            --attention-dropout=${attention_dropout} \
            --weight-decay=0.01 \
            --optimizer=adam \
            --adam-betas="(0.9,0.999)" \
            --adam-eps=1e-08 \
            --clip-norm=1.0 \
            --lr-scheduler=polynomial_decay \
            --lr=${lr} \
            --total-num-update=${total_num_updates} \
            --warmup-updates=${warmup_updates} \
            --log-format=simple \
            --log-interval=10 \
            --fixed-validation-seed=7 \
            --keep-last-epochs=15 \
            --save-interval=1 --validate-interval=1 \
            --max-update=${total_num_updates} \
            --best-checkpoint-metric=vqa_score --maximize-best-checkpoint-metric \
            --max-src-length=${max_src_length} \
            --max-object-length=${max_object_length} \
            --max-tgt-length=${max_tgt_length} \
            --find-unused-parameters \
            --freeze-encoder-embedding \
            --freeze-decoder-embedding \
            --ans2label-file=${ans2label_file} \
            --valid-batch-size=20 \
            --add-type-embedding \
            --scale-attn \
            --scale-fc \
            --scale-heads \
            --disable-entangle \
            --num-bins=${num_bins} \
            --patch-image-size=${patch_image_size} \
            --prompt-type=prev_output \
            --fp16 \
            --fp16-scale-window=512 \
            --add-object \
            ${uses_ema} \
            ${store_ema} \
            ${ema_fp32} \
            --ema-decay=${ema_decay} \
            --ema-start-update=${ema_start_update} \
            --val-inference-type=${val_inference_type} \
            --num-workers=0 > ${log_file} 2>&1
      done
    done
  done
done

@yangapku
Copy link
Member

yangapku commented Jun 15, 2022

Hi, in the script you provided, I notice that the restore_file you specified is ofa_base_bast.pt rather than the finetuned vqa_base_best.pt which you mentioned. I want to make sure which one are you using and finding this error? Typically we suggest to use the pretrained ofa_base.pt checkpoint to perform VQA finetuning, which can reproduce our reported performance using the released script. The ckpt vqa_base_best.pt is already finetuned with the highest validation score on VQA and is expected to be directly used for inference and evaluation. Are you trying to perform continuous finetuning?

@L1-M1ng
Copy link
Author

L1-M1ng commented Jun 15, 2022

hi, thanks for your answer. I downloaded and used the model ofa_base.pt just now, but i still met this error
00a4e2f998f7f44dcc7fc56700f204a
472a4c985c17bcd1c60630c805d0056
.

@L1-M1ng
Copy link
Author

L1-M1ng commented Jun 15, 2022

I use print(self.cfg.checkpoint in trainer.py. I find this dict do not have the key use_ema_weight_to_init_param
d09edc5be95141506d7b9850d93ee42

@yangapku
Copy link
Member

Hi, I have just tested to run this script on my environment, which ran successfully. Please make sure that you pull and use the latest OFA code and ofa_base.pt. After checking the OFA code is the latest, I would suggest to make a reinstall of fairseq. Please first run pip uninstall fairseq to uninstall the old fairseq and then use pip install -r requirements.txt.

@L1-M1ng
Copy link
Author

L1-M1ng commented Jun 16, 2022

Thanks a lot ! It work!
But I met gradient overflow problem when i training on VizWiz dataset and the loss is nan
8b3b3a46f366a53379783999229c7df
This is my trainval_ans2label.pkl file
c512947eae00cb254714c3c9d87a50a

@L1-M1ng
Copy link
Author

L1-M1ng commented Jun 16, 2022

The number of answers appeared in VizWiz dataset is 48000+, is huge number, and make the number of neural units in OFAClassificationHead is huge, too . Can I only choose some answer appear frequently for train? If can, how to modify the code ?

@yangapku
Copy link
Member

Hi, actually the nan issue comes from the preparation of trainval_ans2label.pkl file. It should be constructed specially for VizWiz. In detail, each ground truth answer must be included in this file, otherwise the loss will be nan. This issue has been proposed before. Please refer to this issue #105 for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants