Official repository of ECO from team DSBA Lab for the NICE 2024 Challenge
- Kiyoon Jeong
- Woojun Lee
- Woongchan Nam
- Minjeong Ma
Table of Contents
This guide provides detailed instructions for setting up the necessary environments to run the code for different models including blip2,
EvaCLIP
, MobileCLIP
, MetaCLIP
, and OpenCLIP
. Each model requires a unique environment for optimal performance and compatibility.
You can install the required dependencies by running the following commands:
# blip2
conda create -n blip2 python=3.8
conda activate blip2
pip install salesforce-lavis omegaconf
# EVA-CLIP
conda create -n evaclip python=3.10
conda activate evaclip
git clone https://github.com/baaivision/EVA.git
cd EVA/EVA-CLIP-18B
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt
pip install omegaconf
# CLIP Env (MobileCLIP, MetaCLIP, OpenCLIP)
conda create -n clipenv python=3.10
conda activate clipenv
# Install Mobile CLIP
git clone https://github.com/apple/ml-mobileclip.git
cd ml-mobileclip
pip install -e .
pip install omegaconf
or you can install environment from dockerfile by running the following command:
docker build -t nice2024 .
docker run -it --gpus all nice2024
The original data
can be downloaded from the competition website or you can make use of the get_data.sh
script to download the data.
The scores
can be downloaded from the following links: Google Drive
├── data
│ ├── original_data
│ │ ├── candidate_captions.csv
│ │ ├── images_20k
│ │ └── pred.csv
│ ├── results
│ │ └── #Result csv file will be saved here
│ └── scores
│ ├── blip2_itc_scores.json # You can generate by 1. Score Generation
│ ├── blip2_itm_scores.json # You can generate by 1. Score Generation
│ ├── evaclip_scores.json # You can generate by 1. Score Generation
│ ├── itm_filtered_consensus.json # You can generate by 2. Consensus Score Generation
│ ├── metaclip_scores.json # You can generate by 1. Score Generation
│ ├── mobileclip_scores.json # You can generate by 1. Score Generation
│ └── openclip_scores.json # You can generate by 1. Score Generation
The model weights can be downloaded from the following links: Weights of openclip and blip2 is automatically downloaded when you run the score generation script.
model_name |
model weight |
---|---|
EvaCLIP | EVA_18B_psz14.fp16 (36.7GB ) |
MetaCLIP | ViT-bigG-14-quickgelu (28.38GB ) |
MobileCLIP | mobileclip_blt (571.46MB ) |
or you can download all the model weights by running the following command:
source ./scripts/00_get_model_weights.sh # Files will be downloaded to `model_weights` directory.
├── model_weights
│ ├── evaclip
│ │ └── EVA_CLIP_18B_psz14_s6B.fp16.pt
│ ├── metaclip
│ │ └── G14_fullcc2.5b.pt
│ └── mobileclip
│ └── mobileclip_blt.pt
You need to prepare score files for each vision-language model to generate the final submission file.
You can either generate the scores for each model or you can download the scores from the following links: Google Drive
You need a 80GB of VRAM for running the EVA-CLIP 18B Model.
source ./scripts/01_evaclip_score.sh #script_filename [evaclip_score.sh, metaclip_score.sh, mobileclip_score.sh, openclip_score.sh, blip2_itc_score.sh, blip2_itm_score.sh]
Or you can generate all the scores by running the following command:
source ./scripts/01_all_score.sh
After finishing generating scores for each vision language models. Now you can generate consensus scores by running the following command:
source ./scripts/02_consensus_score.sh
At last, fuse all the scores and generate the final submission file by running the following command:
source ./scripts/03_ensemble_score.sh
Please be aware that when running inference on different GPUs, minor discrepancies in the results may occur, particularly at or beyond the fourth decimal place. These variations stem from differences in floating-point arithmetic precision and hardware architecture. For consistent results, please refer to the provided score JSON files.
The final submission file will be saved in the data/results
directory.
The submission file will be named as pred.csv
and will be in the following format:
id | filename | caption |
---|---|---|
0 | 1586682407.jpg | a man and a woman in lab coats looking at a watch. |
1 | 1866091313.jpg | a man standing in a field of wheat. |
2 | 1722076415.jpg | a group of people riding a ski lift on a sunny day. |
... | ... | ... |
Also you can find the final scores of each caption in data/results
directory.
The final scores will be named as pred.json
and will be in the following format:
{"1839580541.jpg":
{"captions": ["a woman is holding a small blue brush and sitting on a bed.",
"a woman sitting on a bed with a blue paintbrush.",
"a woman holding a roller brush sitting on top of a bed.",
...
],
"scores": [3.537973796768244,
5.110968624476615,
3.5150481446519826,
...
]
},
...
}
The content of this project itself is licensed under LICENSE.
Our codebase is built using multiple opensource contributions.We would like to thank the authors of the following repositories for their valuable contributions