Problems with VQA finetuning #59

markovivl · 2022-03-27T23:44:16Z

Hello! I am trying to finetune OFA-large on VQA using Visual Genome dataset, using the finetuning instruction in the repo. Unfortunately, I have encountered a bug that I have some difficulties indentifying. I preprocessed the data exactly like in an example, but during training my gradients overflow and model does not train.

slice_id 0 seek offset 0
2022-03-28 02:29:07 - trainer.py[line:703] - INFO: begin training epoch 1
2022-03-28 02:29:07 - train.py[line:296] - INFO: Start iterating over samples
2022-03-28 02:29:09 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
2022-03-28 02:29:11 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
2022-03-28 02:29:14 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
2022-03-28 02:29:15 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
2022-03-28 02:29:17 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
2022-03-28 02:29:19 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
2022-03-28 02:29:22 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
2022-03-28 02:29:23 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
2022-03-28 02:29:26 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
2022-03-28 02:29:28 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.125
2022-03-28 02:29:28 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0625

I narrowed the issue to the answers column. If I replace this column in my dataset with the column in the dataset provided in the repo, everything works fine. However, if I change the answers in the column, or even modify them in any way I get the same issue. I suspected that my procedure of changing the column could be a problem, but if I "modify" the column with empty string, it still works. Any other symbol added to the column again concludes in an overflow. I also tried modifying not the whole column, but single elements, and found out that changing certain answers does not lead to an overflow, while changing others does. I was unable to further narrow the issue or find any pattern in it.

I train on single server with 1 GPU.

The text was updated successfully, but these errors were encountered:

yangapku · 2022-03-28T00:41:42Z

Hi Markov, for your custom answer candidate set, please also prepare a custom trainval_ans2label.pkl file (a pickled python dict mapping the answer-text to label-id) to replace the provided one. This file is used in training & inference which constrains the output space from the full vocabulary to only the answer candidate set. It should be conformed with your dataset, otherwise the overflow problem will arise during training if an answer unseen in the trainval_ans2label.pkl is encountered in your dataset.

markovivl · 2022-03-28T09:02:39Z

Doesn't it cripple the zero shot capability?

yangapku · 2022-03-28T09:29:49Z

Hi, since we have utilized various sources of VQA samples during pretraining, for zero-shot (open-domain) VQA, we directly turn to the pretrained OFA-Large, which do not set this constraint. For more details of zero-shot VQA inference, please refer to the open-domain VQA Colab (url). The VQA fine-tuning process is specifically targeted for the VQAv2 challenge, whose answer is more restricted into the 3,129 candidate answer set. To achieve higher accuracy on this specific challenge, we use trie-based constrained training & inference using the trainval_ans2label.pkl file.

phanxuanphucnd · 2022-10-18T07:08:56Z

Hi @yangapku , please provide a given example for trainval_ans2label.pkl . I don' understand what a pair answer-text to label-id is?

Thanks

yangapku · 2022-10-18T11:01:12Z

@phanxuanphucnd You can refer to the trainval_ans2label.pkl file we provided for VQAv2.

phanxuanphucnd · 2022-10-21T04:03:31Z

Hi @yangapku

An illustrate example of trainval_ans2label.pkl file such as:

{ "": 0, "boats": 835, "not at all": 2421, "name": 1, "harley davidson": 78, "plain": 2, "20 ft": 2379, "museum": 3, "parking": 1710, "behind": 1590, "steeple": 4, "turning": 2380, "tent": 836, "no parking": 1995, "tulip": 1568, "low": 452, "muffin": 1172, "9:55": 846, "hair": 453 }
I don't understand what does it mean? By what mechanism are indexes assigned?
Can you help me explain?

Thanks

yangapku · 2022-10-21T04:10:11Z

Hi, it's a python-dict which does mapping from candidate answer text to its index (starting from 0). The indexes can be assigned just by random with no specific rules. Just make sure that each candidate answer is assigned with a unique index and the indexes are assigned continuously from 0. All the ground-truth answers of training and validation samples should be included in this candidate answer set.

phanxuanphucnd · 2022-10-21T04:17:04Z

Yes, i understand it as follows:
All answers in the question-answer pair of the train and valid dataset must be included in this file, right?
it will be unique values?

THanks @yangapku

yangapku · 2022-10-21T06:12:13Z

@phanxuanphucnd Yes. In our practice on VQAv2 dataset which has a long-tailed distribution of all the appeared ground-truth answers, we follow the common practice which uses the most frequent 3,129 answers as the candidate set to build this dict. Then we filtered the original training and valid split, only the question-answer pair whose answer is in this candidate set is kept for finetuning OFA.

yangapku self-assigned this Mar 28, 2022

yangapku closed this as completed Apr 5, 2022

yangapku mentioned this issue Apr 21, 2022

OFA on customised task e.g. OK-VQA #76

Closed

Velcorn mentioned this issue May 13, 2022

Additional issues trying to finetune on custom (VQA-like) dataset (VizWiz) #105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with VQA finetuning #59

Problems with VQA finetuning #59

markovivl commented Mar 27, 2022

yangapku commented Mar 28, 2022

markovivl commented Mar 28, 2022 •

edited

Loading

yangapku commented Mar 28, 2022 •

edited

Loading

phanxuanphucnd commented Oct 18, 2022

yangapku commented Oct 18, 2022

phanxuanphucnd commented Oct 21, 2022 •

edited

Loading

yangapku commented Oct 21, 2022

phanxuanphucnd commented Oct 21, 2022 •

edited

Loading

yangapku commented Oct 21, 2022

Problems with VQA finetuning #59

Problems with VQA finetuning #59

Comments

markovivl commented Mar 27, 2022

yangapku commented Mar 28, 2022

markovivl commented Mar 28, 2022 • edited Loading

yangapku commented Mar 28, 2022 • edited Loading

phanxuanphucnd commented Oct 18, 2022

yangapku commented Oct 18, 2022

phanxuanphucnd commented Oct 21, 2022 • edited Loading

yangapku commented Oct 21, 2022

phanxuanphucnd commented Oct 21, 2022 • edited Loading

yangapku commented Oct 21, 2022

markovivl commented Mar 28, 2022 •

edited

Loading

yangapku commented Mar 28, 2022 •

edited

Loading

phanxuanphucnd commented Oct 21, 2022 •

edited

Loading

phanxuanphucnd commented Oct 21, 2022 •

edited

Loading