We currently support evaluation for three main kinds of tasks:
- VQA (Comprehension)
- NLP (Comprehension)
- Text2Image (Creation)
We have supported the evaluation on several datasets at omni/eval/vqa
.
All formatted annotation files omni_comprehension_eval_format_files
can be downloaded from Google Drive, the formatted MM-Vet dataset MM-VET
can be downloaded from Google Drive. Please download them and put them under data
.
Dataset | Task | Metrics | Evaluation Server |
---|---|---|---|
VQAv2 | VQA | Accuarcy | Server |
OKVQA | VQA | Accuarcy | Local |
VizWiz | VQA | Accuarcy | Server |
TextVQA | VQA | Accuarcy | Local |
MM-Vet | VQA | Accuarcy | Server |
MMBench | VQA | Accuarcy | Server |
COCO Caption | Captioning | CIDEr | Local |
Image2Paragraph | Captioning | CIDEr | Local |
NoCaps | Captioning | CIDEr | Local |
DocVQA | VQA | ANLS | Server |
InfographicVQA | VQA | ANLS | Server |
POPE | Hallucination | Accuracy | Local |
Notes: DocVQA and InfographicVQA require high resolution to get a reasonable result, so a model that is trained on low-resolution images (e.g., 224x224) and uses CLIP as the vision encoder will get a very low performance. Models like Vary that use high-resolution images and hybrid image representations will be better at this task.
To evaluate VQA tasks such as MM-Vet, please run the following:
# MM-Vet (Submit to https://huggingface.co/spaces/whyu/MM-Vet_Evaluator)
python -m omni.eval.vqa.eval_dreamllm \
--model_name path2model \
--gtfile_path ./data/MM-VET/MMGPT_mm-vet.json \
--image_path ./data/MM-VET/images \
--out_path ./samples/mm-vet \
--num-chunks 1 \
--datatype mmvet \
--img_aug none \
--beamsearch True \
--evaltype "all" \
--prompt "Please provide an accurate and detailed answer." \
--system_prompt "This is an exam, please answer according to the image and question."
Then, submit the file results_final.json
to server for the results.
We have provided a script, scripts/eval/vqa/eval_vqa.sh
for testing different benchmarks.
We have supported NLP evaluation on multi-task language processing and other QA datasets at omni/eval/language_eval
.
All formatted annotation files can be downloaded from Google Drive. The MMLU dataset can be downloaded from Google Drive.
Dataset | Task | Evaluation Server |
---|---|---|
BoolQ | QA | Local |
PIQA | QA | Local |
SIQA | QA | Local |
HellaSwag | QA | Server |
WinoGrande | QA | Local |
MMLU | Multi-task | Local |
We have integrated a comprehensive evaluation toolkit called llama_evlaution_main
. This toolkit supports various dataset evaluations with huggingface, but the dataset split may be different from the official ones that are typically used in papers. For official comparison, you can run the evaluation scripts at omni/eval/language_eval/submission_scripts
. For example, if you want to evlaute BoolQ accuracy, please run:
python omni/eval/language_eval/submission_scripts/submission_dev_boolq.py \
--model_dir path2model
We have supported text-to-image evaluation on COCO and LN-COCO at omni/eval/text2img
.
- You have to first prepare the MS COCO images or the FID statistics files. The caption annotation files include
captions_train2014.json
andcaptions_val2014.json
for MS COCO andlncoco_captions_val2017.jsonl
for LN COCO. To calculate FID, you have to prepare thefid_stats.npz
file, which isfid_stats_mscoco256_val.npz
for MS COCO andfid_stats_lncoco256_val5k.npz
for LN COCO. We have uploaded all these files incoco_fid_files
on Goolge Drive. - If you have your own dataset, you can make the fid_stats file by running
python ./third_party/pytorch-fid/src/pytorch_fid/fid_score.py \
--path "path2images" "path2fid_stats.npz" \
--resolution 256 \
--batch-size 50 \
--save-stats
We currently support MS COCO and LN COCO datasets.
Dataset | Task | Metrics | Evaluation Server |
---|---|---|---|
MS COCO | Text2Image | FID | Local |
LN COCO | Text2Image | FID | Local |
We have provided the scripts to run COCO FID evaluation based on 8-times selection with CLIP. Just run:
OUT_DIR="OUTPUT_DIR"
MODEL_NAME_OR_PATH="YOUR_MODEL_PATH"
sh scripts/eval/text2img/eval_coco_zero_shot_clip8_select.sh $OUTPUT_DIR $MODEL_NAME_OR_PATH