In Desnse connector, we evaluate our models across 19 diverse benchmarks, including 11 image benchmarks and 8 video benchmarks. You can download our model weights to conduct tests and view our evaluation results from the Model Zoo.
Our evaluation is divided into two parts: image and video assessments. Our image evaluation script is based on LLaVA. For video evaluation, we adopted the FreeVA method to extend our model's capabilities into video understanding. Our model exhibits exceptional temporal understanding abilities without being exposed to any video data during the training phase.
Our image evaluation follow the LLaVA guidelines. For a more detailed description of the evaluation process, please refer to the link provided here.
We also provide an evaluation example: if you wish to assess the performance of our model on GQA, you can run the following command:
sh scripts/v1_5/eval/gqa.sh
Please note that if you want to evaluate the model on MMMU benchmark, you should first unzip the dc/eval/MMMU.zip
file first.
Our video evaluation process is divided into two steps: the first step involves generating video prediction results using scripts from here, and the second step involves evaluation on GPT-3.5 from here.
For example, if you want to evaluate the model on MSVD-QA benchmark, you should follow these steps:
Run the following command to generate video predictions:
sh scripts/v1_5/eval/video/run_qa_msvd.sh
We use the --use_pool
option to reduce the number of tokens, allowing the Dense Connector to process more frames.
Moreover, the upper limit of frames T is determined by the max_position_embeddings
of the large language model. For example, when using vit-L/336px with pooling (where each frame is downsampled by a factor of two), each frame results in 288 tokens. Therefore, the setting of T should satisfy T*288 < max_position_embeddings.
After generating video predictions, we evaluate them by GPT-3.5. The command is as follows:
sh scripts/v1_5/eval/video/gpt_eval/eval_qa_msvd.sh