You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"MED_INS_00001": {
"instruction":"XXX",
"answer":"XXX.",
"image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
"rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
},
I was delighted to stumble upon this remarkable project. Thank you for your valuable contribution.
I am now doing a medical image(with multi slices and one description for each patient) captioning task. According to the above comment, I formed the training data MED.json and MED_instruction. Here's how the instruction json looks like:
{
"meta": {
"version": "",
"time": "",
"author": ""
},
"data": {
"test_INS_00000": {
"instruction": "",
"answer": ".\n ",
"image_ids": [
"MED_IMG_1",
"MED_IMG_2",
"MED_IMG_3",
"MED_IMG_4",
"MED_IMG_5",
"MED_IMG_6",
"MED_IMG_7",
"MED_IMG_8",
"MED_IMG_9",
"MED_IMG_10",
"MED_IMG_11",
"MED_IMG_12",
"MED_IMG_13",
"MED_IMG_14",
"MED_IMG_15",
"MED_IMG_16",
"MED_IMG_17",
"MED_IMG_18",
"MED_IMG_19",
"MED_IMG_20",
"MED_IMG_21",
"MED_IMG_22",
"MED_IMG_23",
"MED_IMG_24"
],
"rel_ins_ids": []
},
.....
}
The version of Otter I'm using is the 8/17 commit, and I've successfully got the generated caption and evaluated them with BLEU and CIDEr. However, I accidentally discovered that using the VQA mode has on par performance compared to SD mode, and different instruction is resulting in more diverse performance. Does that mean the SD mode doesn't suit my training scenerio, and VQA mode can help me test my instructions?
Furthermore, I'm trying to use the BiomedCLIP image decoder like the LLaVA-Med paper did. However, the 0817 instruction_following.py had no customized_config statement, and adding customized_config statements on the instruction_following.py from the 0830 commit does nothing. The resulting checkpoint config still writes CLIP.
Here's the config.json I created as the 0830 commit suggested.
{
"model_type": "otter",
"cross_attn_every_n_layers": 4,
"tie_word_embeddings": false,
"use_media_placement_augmentation": true,
"only_attend_previous": true,
"text_config": {
"_name_or_path": "luodian/llama-7b-hf",
"model_type": "llama"
},
"vision_config": {
"_name_or_path": "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224",
"model_type": "clip_vision_model",
"hidden_size": 768,
"intermediate_size": 3072,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"image_size": 224,
"patch_size": 16
}
}
Looking forward to exploring this topic and citing you and your colleagues on any possible publication!
The text was updated successfully, but these errors were encountered:
Does that mean the SD mode doesn't suit my training scenario, and VQA mode can help me test my instructions?
In your case, one instruction pairing with multiple images, we recommend to use SD mode. Though the achieved performance based on SD mode or VQA mode might be same in your case, the SD mode is logically reasonable in your data construction scenario.
Does that mean the SD mode doesn't suit my training scenario, and VQA mode can help me test my instructions?
In your case, one instruction pairing with multiple images, we recommend to use SD mode. Though the achieved performance based on SD mode or VQA mode might be same in your case, the SD mode is logically reasonable in your data construction scenario.
Thank you so much for your reply. We'll continue on our SD experiment.
Regarding the vision encoder, do you have any solution to replacing it?
Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:
Otter/pipeline/mimicit_utils/mimicit_dataset.py
Line 432 in 9b34a44
To achieve this, you may follow these steps:
to:
This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.
to:
If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.
Originally posted by @ZhangYuanhan-AI in #234 (comment)
I was delighted to stumble upon this remarkable project. Thank you for your valuable contribution.
I am now doing a medical image(with multi slices and one description for each patient) captioning task. According to the above comment, I formed the training data MED.json and MED_instruction. Here's how the instruction json looks like:
{
"meta": {
"version": "",
"time": "",
"author": ""
},
"data": {
"test_INS_00000": {
"instruction": "",
"answer": ".\n ",
"image_ids": [
"MED_IMG_1",
"MED_IMG_2",
"MED_IMG_3",
"MED_IMG_4",
"MED_IMG_5",
"MED_IMG_6",
"MED_IMG_7",
"MED_IMG_8",
"MED_IMG_9",
"MED_IMG_10",
"MED_IMG_11",
"MED_IMG_12",
"MED_IMG_13",
"MED_IMG_14",
"MED_IMG_15",
"MED_IMG_16",
"MED_IMG_17",
"MED_IMG_18",
"MED_IMG_19",
"MED_IMG_20",
"MED_IMG_21",
"MED_IMG_22",
"MED_IMG_23",
"MED_IMG_24"
],
"rel_ins_ids": []
},
.....
}
The version of Otter I'm using is the 8/17 commit, and I've successfully got the generated caption and evaluated them with BLEU and CIDEr. However, I accidentally discovered that using the VQA mode has on par performance compared to SD mode, and different instruction is resulting in more diverse performance. Does that mean the SD mode doesn't suit my training scenerio, and VQA mode can help me test my instructions?
Furthermore, I'm trying to use the BiomedCLIP image decoder like the LLaVA-Med paper did. However, the 0817 instruction_following.py had no customized_config statement, and adding customized_config statements on the instruction_following.py from the 0830 commit does nothing. The resulting checkpoint config still writes CLIP.
Here's the config.json I created as the 0830 commit suggested.
{
"model_type": "otter",
"cross_attn_every_n_layers": 4,
"tie_word_embeddings": false,
"use_media_placement_augmentation": true,
"only_attend_previous": true,
"text_config": {
"_name_or_path": "luodian/llama-7b-hf",
"model_type": "llama"
},
"vision_config": {
"_name_or_path": "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224",
"model_type": "clip_vision_model",
"hidden_size": 768,
"intermediate_size": 3072,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"image_size": 224,
"patch_size": 16
}
}
Looking forward to exploring this topic and citing you and your colleagues on any possible publication!
The text was updated successfully, but these errors were encountered: