diff --git a/applications/DeepSpeed-VisualChat/README.md b/applications/DeepSpeed-VisualChat/README.md
new file mode 100755
index 000000000..01fbde04f
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/README.md
@@ -0,0 +1,116 @@
+## DeepSpeed-VisualChat: Enabling Multi-Round Multi-Image Chat for All Scales
+
+An easy-to-use, scalable, and efficient multi-modal training pipeline for multi-round multi-image interleave chat experience.
+
+
+## Table of Contents
+
+- [📰 Latest News 📰](#-latest-news-)
+- [🚀 What is DeepSpeed-VisualChat 🚀️](#-what-is-deepspeed-visualchat-)
+- [⚓ Get Started, Tutorial, and Documentation ⚓](#-get-started-tutorial-documentation-)
+- [🌱 DeepSpeed-VisualChat's Roadmap 🌱](#-deepspeed-visualchats-roadmap-)
+- [💬 DeepSpeed-VisualChat and DeepSpeed Community 💬](#-deepspeed-visualchat-and-deepspeed-community-)
+- [🙏 Acknowledgement and Citation 🙏](#-acknowledgement-and-citation-)
+
+<!-- markdown-toc end -->
+
+## 📰 Latest News 📰
+
+* ***[2023/10] [DeepSpeed-VisualChat: Improve Your Chat Experience with Multi-Round Multi-Image Inputs](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-visualchat/10-03-2023/README.md)***
+
+⭐ If you find our [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples) repositories beneficial, please give them a star on GitHub! To cite DeepSpeed-VisualChat, please cite our [arxiv report](https://arxiv.org/abs/2309.14327):
+
+```
+@article{yao2023deepspeed-visualchat,
+  title={{DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention}},
+  author={Zhewei Yao and Xiaoxia Wu and Conglong Li and Minjia Zhang and Heyang Qin and Olatunji Ruwase and Ammar Ahmad Awan and Samyam Rajbhandari and Yuxiong He},
+  journal={arXiv preprint arXiv:2309.14327},
+  year={2023}
+}
+```
+
+## 🚀 What is DeepSpeed-VisualChat 🚀
+<div align="center">
+
+<img src="assets/hero-figure.png" alt="DeepSpeed-VisualChat Banner!"/>
+Figure 1. On the left is a DeepSpeed-VisualChat model, featuring an innovative attention design. On the right is an example of DeepSpeed-VisualChat.
+
+</div>
+
+---
+
+With increasing interest in enabling the multi-modal capabilities of large language models, DeepSpeed is proud to announce a new training pipeline, named ***DeepSpeed-VisualChat***. This is designed for enabling a multi-round, multi-image interleave chat framework. It enhances the language model with image understanding and reasoning capabilities. Unlike the majority of open-sourced multi-modal projects, the primary focus of DeepSpeed-VisualChat is to provide a multi-round, multi-image interleave chat experience, as illustrated in Figure 1.
+
+To improve model quality without introducing new parameters, DeepSpeed-VisualChat incorporates a new multi-modal causal attention mechanism, which is adept at better aligning visual and text features. Additionally, to overcome the scarcity of interleaved text-and-image inputs in most available open-sourced datasets, we employ various data blending techniques on existing datasets.
+
+Thanks to the scalable, efficient, and user-friendly nature of the DeepSpeed ecosystem, we have the capability to train using a 2B visual encoder from QWen-VL (one is additionally refined from OpenClip) and a 70B language decoder from LLaMA-2. This showcases the extraordinary scalability of the DeepSpeed-VisualChat framework.
+
+
+
+
+
+## ⚓ Get Started, Tutorial, and Documents ⚓
+
+### 🐼 Installation
+
+
+```bash
+git clone https://github.com/microsoft/DeepSpeedExamples.git
+cd DeepSpeedExamples/applications/DeepSpeed-VisualChat/
+pip install -r requirements.txt
+```
+
+### 🐼 Datasets Preparation
+
+Table below summarizes where to download the datasets that we support. `{data_path}` denotes the `--data_path` argument provided in training scripts.
+
+| Dataset name | Where to download |
+|--------------|-------------------|
+| aokvqa | Download `2017 Train images [118K/18GB]` from [https://cocodataset.org/#download](https://cocodataset.org/#download) and save at `{data_path}/coco/train2017/`. Download `aokvqa_v1p0_train.json` from [https://allenai.org/project/a-okvqa/home](https://allenai.org/project/a-okvqa/home) and save at `{data_path}/aokvqa/annotations/`. |
+| coco_caption | Download 2014 Train images and 2014 Val images from [https://cocodataset.org/#download](https://cocodataset.org/#download) and save all images at `{data_path}/coco/2014/`. Download `dataset.json` from [https://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip](https://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip) and save at `{data_path}/coco_caption/`. |
+| llava | Download `2017 Train images [118K/18GB]` from [https://cocodataset.org/#download](https://cocodataset.org/#download) and save at `{data_path}/coco/train2017/`. Download `detail_23k.json` and `complex_reasoning_77k.json` from [https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and save at `{data_path}/llava/`. |
+| llava_dial | Download `2017 Train images [118K/18GB]` from [https://cocodataset.org/#download](https://cocodataset.org/#download) and save at `{data_path}/coco/train2017/`. Download `conversation_58k.json` from [https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and save at `{data_path}/llava/`. |
+| llava_otter_blend | Follow instructions of the llava, llava_dial, and otter_mimicit_cgd datasets. |
+| minigpt4 | Download `image` folder and `filter_cap.json` from [https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align](https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align) and save at `{data_path}/cc_sbu_align/`. |
+| ocr_vqa |  Download `images` folder and `dataset.json` from [https://ocr-vqa.github.io/](https://ocr-vqa.github.io/) and save at `{data_path}/OCR_VQA/`. |
+| otter_mimicit_cgd | Download `2017 Train images [118K/18GB]` from [https://cocodataset.org/#download](https://cocodataset.org/#download) and save at `{data_path}/coco/train2017/`. Download `CGD_instructions.json` from [https://huggingface.co/datasets/pufanyi/MIMICIT](https://huggingface.co/datasets/pufanyi/MIMICIT) and save at `{data_path}/MIMIC-IT/`. |
+| otter_mimicit_sd | Download `SD.json` and `SD_instructions.json` from [https://huggingface.co/datasets/pufanyi/MIMICIT](https://huggingface.co/datasets/pufanyi/MIMICIT) and save at `{data_path}/MIMIC-IT/`. |
+| otter_mimicit_sn | Download `SN.json` and `SN_instructions.json` from [https://huggingface.co/datasets/pufanyi/MIMICIT](https://huggingface.co/datasets/pufanyi/MIMICIT) and save at `{data_path}/MIMIC-IT/`. |
+| otter_mimicit_tvc | Download `TVC.json` and `TVC_instructions.json` from [https://huggingface.co/datasets/pufanyi/MIMICIT](https://huggingface.co/datasets/pufanyi/MIMICIT) and save at `{data_path}/MIMIC-IT/`. |
+| otter_mimicit_vst | Download `VST.json` and `VST_instructions.json` from [https://huggingface.co/datasets/pufanyi/MIMICIT](https://huggingface.co/datasets/pufanyi/MIMICIT) and save at `{data_path}/MIMIC-IT/`. |
+| sparkles_dialogue | Download the `SparklesDialogueCC` and `SparklesDialogueVG` folders from the OneDrive link from [https://github.com/HYPJUDY/Sparkles](https://github.com/HYPJUDY/Sparkles) and save at `{data_path}/`. |
+
+### 🐼 Training, Evaluation, Chat API, and Helper
+Please refer to 
+  - [**Training**](./training/README.md)
+  - [**Evaluation**](./eval/README.md)
+  - [**Chat**](./chat/README.md)
+  - [**Helper**](./helper/README.md)
+
+
+## 🌱 DeepSpeed-VisualChat's Roadmap 🌱
+
+Our future plan includes but not limited to :
+- [ ] Support more models
+- [ ] Demonstrate how to training larger models with higher model quality
+
+## 💬 DeepSpeed-VisualChat and DeepSpeed Community 💬
+
+Just like how the success of [the BLOOM model](https://huggingface.co/bigscience/bloom) was supported by both [DeepSpeed Team](https://github.com/bigscience-workshop/Megatron-DeepSpeed) and many [open source contributors](https://huggingface.co/bigscience), we welcome all AI developers/practitioners/researchers to join this on-going effort for DeepSpeed-Chat. To participate:
+- Show your support by leaving a star ⭐ to our [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples) GitHub repositories.
+- Follow us on [twitter](https://twitter.com/MSFTDeepSpeed) to get notified about our latest news. For Chinese users, you can also follow our [Chinese Zhihu account](https://www.zhihu.com/people/deepspeed). For Japanese users, you can also follow our [Japanese twitter account](https://twitter.com/MSFTDeepSpeedJP).
+- Currently we prefer to interact with open source users mainly on GitHub so that it's easier for all users to search for related information. For bug reports, please submit a GitHub issue. For contribution, please submit a pull request (PR). For general question/discussion, please open a new discussion or join any existing discussions.
+- We are open to collaborations with universities, research labs, and companies, such as working together on deep learning research, applying DeepSpeed to empower real-world AI models and applications, and so on. For such requests (and other requests unsuitable for GitHub), please directly email to deepspeed-info@microsoft.com.
+
+
+## 🙏 Acknowledgement and Citation 🙏
+
+We thank the following papers and open-source repositories:
+
+    [1] LLaVa, https://github.com/haotian-liu/LLaVA
+    [2] Otter, https://github.com/Luodian/Otter
+    [3] Transformers Hugging Face, https://github.com/huggingface/transformers
+    [4] MiniGPT4, https://github.com/Vision-CAIR/MiniGPT-4
+    [5] QWen-VL, https://github.com/QwenLM/Qwen-VL
+    [6] Sparkles, https://github.com/HYPJUDY/Sparkles
+    [7] Multimodal-GPT, https://github.com/open-mmlab/Multimodal-GPT
diff --git a/applications/DeepSpeed-VisualChat/assets/banner.png b/applications/DeepSpeed-VisualChat/assets/banner.png
new file mode 100644
index 000000000..2684cafc2
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/assets/banner.png differ
diff --git a/applications/DeepSpeed-VisualChat/assets/ceos.png b/applications/DeepSpeed-VisualChat/assets/ceos.png
new file mode 100644
index 000000000..e148f545a
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/assets/ceos.png differ
diff --git a/applications/DeepSpeed-VisualChat/assets/friends.png b/applications/DeepSpeed-VisualChat/assets/friends.png
new file mode 100644
index 000000000..2689d8d4b
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/assets/friends.png differ
diff --git a/applications/DeepSpeed-VisualChat/assets/hero-figure.png b/applications/DeepSpeed-VisualChat/assets/hero-figure.png
new file mode 100644
index 000000000..ca79b2c62
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/assets/hero-figure.png differ
diff --git a/applications/DeepSpeed-VisualChat/assets/model.png b/applications/DeepSpeed-VisualChat/assets/model.png
new file mode 100644
index 000000000..dbd1f05c4
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/assets/model.png differ
diff --git a/applications/DeepSpeed-VisualChat/chat/README.md b/applications/DeepSpeed-VisualChat/chat/README.md
new file mode 100644
index 000000000..155061f1a
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/chat/README.md
@@ -0,0 +1 @@
+We provide a CLI interface for uses to test their trained chat model. First of all, please note that you need to provide both the trained checkpoint and the original language model & vision encoder paths. The model is first initialized and then loads the trained checkpoint. Also, please note that if you used multi-modal causal attention during your training, remember to put --enable_mmca_attention for in your chat script.
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/chat/chat.py b/applications/DeepSpeed-VisualChat/chat/chat.py
new file mode 100644
index 000000000..57634e957
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/chat/chat.py
@@ -0,0 +1,169 @@
+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+import argparse
+import os
+import sys
+from PIL import Image
+
+import torch
+import deepspeed
+
+sys.path.append(
+    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))
+from utils.utils import print_rank_0
+from utils.model import create_dsvl_model_and_transforms
+import utils.data.DST as DST 
+from transformers import AutoTokenizer
+from termcolor import colored
+import re
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description=
+        "CLI chat")
+    parser.add_argument(
+        "--lm_model_name_or_path",
+        type=str,
+        help=
+        "Path to pretrained model or model identifier from huggingface.co/models.",
+        required=True,
+    )
+    parser.add_argument("--vision_model_name_or_path", default="openai/clip-vit-large-patch14", type=str)
+    parser.add_argument(
+        "--checkpoint_path",
+        default=None,
+        type=str,
+        help="path to pretrained model",
+    )
+    parser.add_argument(
+        "--max_seq_len",
+        type=int,
+        default=4096,
+        help="The maximum sequence length.",
+    )
+    parser.add_argument(
+        "--generation_length_per_round",
+        type=int,
+        default=256,
+        help="The generation length per conversation round.",
+    )
+    parser.add_argument(
+        "--enable_mmca_attention",
+        action='store_true',
+        help="enable the new proposed attn, which is similar to cross attention",
+    )
+    parser.add_argument(
+        "--vis_proj",
+        type=str,
+        default='baseline',
+        help="baseline, vit, or perceiver",
+    )
+    parser = deepspeed.add_config_arguments(parser)
+    args = parser.parse_args()
+
+    return args
+
+
+def get_user_text_input():
+    tmp = input(colored("Enter input (type 'quit' to exit, 'clear' to clean memory): ", 'green'))
+    return tmp, tmp == "quit", tmp == "clear"
+
+def get_user_image_input():
+    tmp = input(colored("Enter image pathes, seperate by space (only support one image per time for now) (type 'na' for empty image): ", 'blue'))
+    return tmp, not tmp == "na"
+
+def main():
+    args = parse_args()    
+    tokenizer = AutoTokenizer.from_pretrained(args.lm_model_name_or_path,
+                                              fast_tokenizer=True)
+    tokenizer.padding_side = 'right'
+    model, image_processor, tokenizer = create_dsvl_model_and_transforms(
+        text_tokenizer = tokenizer,
+        ds_config=None,
+        args=args,
+    )
+
+    model.load_state_dict(torch.load(os.path.join(args.checkpoint_path, 'pytorch_model.bin'), map_location='cpu'), strict=False) # Z3 wouldn't save pos embeddings (vis and rope)
+    
+    model = model.eval()
+    model.projection = model.projection.to('cuda')
+    model.vis_encoder = model.vis_encoder.to('cuda')
+    model = model.half()
+    print_rank_0(model) 
+    
+    num_rounds  = 0 
+    images = []
+    system_instruct = []
+    TEMPLATE = DST.Prompter() # get template
+    image_num_token_list = [DST.IMAGE_NUM_1, DST.IMAGE_NUM_2, DST.IMAGE_NUM_3, DST.IMAGE_NUM_4, DST.IMAGE_NUM_5, DST.IMAGE_NUM_6, DST.IMAGE_NUM_7, DST.IMAGE_NUM_8]
+    
+    while True:
+        num_rounds  += 1
+        while True:
+            # it is super easy to make mistake here, so we need to be careful
+            image_input, with_image = get_user_image_input()
+            if with_image:
+                try:
+                    # seperate by space 
+                    image_paths = image_input.split(' ')
+                    tmp_images = []
+                    for image_path in image_paths:
+                        image = Image.open(image_path).convert('RGB')
+                        tmp_image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0].unsqueeze(0).cuda().half()
+                        tmp_images.append(tmp_image_tensor) # in case the last image path is wrong
+                except:
+                    print(colored("Invalid image path, please try again", 'red'))
+                    continue
+                if len(images) + len(tmp_images) > 8:
+                    print(colored("Too many images, we at most support 8 images. please try again", 'red'))
+                    continue
+                images = images + tmp_images # get all images
+                image_num = len(tmp_images)
+                break
+            else:
+                image_num = 0
+                break
+        assert len(images) >= 1, "We need at least one image to begin the conversation for now."
+        if len(images) > 0:
+            image_tensor = torch.cat(images, dim=0) # cat all images
+        else:
+            image_tensor = None
+
+        text_input, quit, clear = get_user_text_input()
+        if quit:
+            break
+        if clear:
+            num_rounds = 0 
+            images = []
+            system_instruct = []
+            image_num_token_list = [DST.IMAGE_NUM_1, DST.IMAGE_NUM_2, DST.IMAGE_NUM_3, DST.IMAGE_NUM_4, DST.IMAGE_NUM_5, DST.IMAGE_NUM_6, DST.IMAGE_NUM_7, DST.IMAGE_NUM_8]
+            continue
+        
+
+        full_prompt = TEMPLATE(text_input, with_image=with_image, first_message=(num_rounds==1), num_images=image_num)
+        if with_image:
+            for i in range(image_num):
+                full_prompt = re.sub(DST.DEFAULT_HUMAN_IMAGE_PRETOKEN, image_num_token_list.pop(0), full_prompt, count=1)
+                    
+
+        full_prompt_ids = tokenizer(full_prompt).input_ids # remove bos token
+        
+        input_ids = torch.as_tensor([system_instruct + full_prompt_ids]).cuda() # entire input as system instruction for simplicity
+        generate_output = model.generate(image_tensor, input_ids, generation_length=args.generation_length_per_round)
+        extend_ids = generate_output[0].cpu().tolist()[0]
+        while extend_ids[-1] == tokenizer.pad_token_id:
+            extend_ids.pop()
+        while extend_ids[0] == tokenizer.bos_token_id:
+            extend_ids.pop(0)
+        system_instruct = system_instruct + full_prompt_ids + extend_ids # entire input as system instruction for simplicity
+        system_instruct = system_instruct + [tokenizer.eos_token_id] # add eos token
+
+        print(f"=========== Round {num_rounds} ===========")
+        print(tokenizer.decode(system_instruct))
+        
+        
+if __name__ == "__main__":
+    main()
diff --git a/applications/DeepSpeed-VisualChat/chat/chat_scripts/run.sh b/applications/DeepSpeed-VisualChat/chat/chat_scripts/run.sh
new file mode 100644
index 000000000..8c193d520
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/chat/chat_scripts/run.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+MAIN_PATH=$1
+
+VISION_ENCODER=/blob/transformers_cache/qwen-clip
+LLM=/blob/transformers_cache/Llama-2-13b-hf
+
+export CUDA_VISIBLE_DEVICES=0  # Do multi single evaluation 
+# export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7  # Do multi gpu evaluation for large models (single GPU is not enough)
+
+
+python chat.py \
+    --lm_model_name_or_path  $LLM \
+    --vision_model_name_or_path $VISION_ENCODER \
+    --checkpoint_path $MAIN_PATH --enable_mmca_attention
diff --git a/applications/DeepSpeed-VisualChat/eval/README.md b/applications/DeepSpeed-VisualChat/eval/README.md
new file mode 100644
index 000000000..e39bbf035
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/README.md
@@ -0,0 +1,28 @@
+### ☀️Evaluation
+We provide a few examples to test the quality of the models.
+To run the tests, use the `batch_generation.py` script, which will call the JSON file located in  `eval_data/*.json`.
+You will need to specify the model path where you've saved your checkpoints. For example, if you've saved your model checkpoint at $YOUR_CHECKPOINT_PATH/epoch-5/pytorch_model.bin, then pass the following arguments: 
+```
+--checkpoint_path $YOUR_CHECKPOINT_PATH --checkpoint_names epoch-5
+```
+
+##### 🏃 Run the Code
+NOTE: Before you run the code `run_batch.sh`, please read it carefully. This bash script creates a folder eval/results/eval_comprehensive if you use the json evaluation "eval_comprehensive". It will write to "eval/results/eval_comprehensive/{args.output_filename}.csv" file with four columns. The generation output is in the last column. Please read one of our examples such as `eval/results/eval_comprehensive/ours-set1_final.csv`.
+To run the code, you need to go to outside the current folder
+```
+cd DeeSpeedExamples/applications/DeepSpeed-VisualChat
+bash eval/run_batch.sh
+```
+
+
+#### 🐕 Our Model Results Overview
+We present the outcomes from our three distinct models, each trained with vision encoders: `qwen-clip` and `Llama-2-13b-hf`.
+
+###### Results Directories and Training Details:
+- **results/eval_single:**  
+  This directory contains results from the model trained with LoRA, featuring a dimension size of 128.
+
+- **results/eval_comprehensive** and **results/eval_robustness:**  
+  These directories host results from two models:
+  - One model is trained excluding the Sparkles dataset (referred to as `ours-set1`).
+  - The other incorporates Sparkles dataset in the training (denoted as `ours-set2`).
diff --git a/applications/DeepSpeed-VisualChat/eval/batch_generation.py b/applications/DeepSpeed-VisualChat/eval/batch_generation.py
new file mode 100644
index 000000000..c22c2f6a9
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/batch_generation.py
@@ -0,0 +1,226 @@
+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+import argparse
+import os
+import csv
+import sys
+from PIL import Image
+
+import torch
+import deepspeed
+sys.path.append(
+    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))
+from utils.utils import print_rank_0, to_device
+from utils.model import create_dsvl_model_and_transforms as create_model_and_transforms
+import utils.data.DST as DST 
+
+
+
+
+from typing import Iterable
+from transformers import AutoTokenizer, set_seed
+import json
+import collections
+import numpy as np 
+import random
+def load_json(file_path):
+    with open(file_path, 'r') as json_file:
+        data = json.load(json_file)
+    return data
+
+# usage
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description=
+        "Finetune a transformers model on a causal language modeling task")
+    parser.add_argument('--question-format',
+                        type=str,
+                        default="prompt_qa",
+                        choices=['prompt_qa', 'prompt_choice'],
+                        help='question-format')
+    parser.add_argument('--question',
+                        type=str,
+                        default="please describe the image",
+                        help='question-format')
+    parser.add_argument(
+        "--lm_model_name_or_path",
+        type=str,
+        help=
+        "Path to pretrained model or model identifier from huggingface.co/models.",
+        required=True,
+    )
+    parser.add_argument("--vision_model_name_or_path", default="openai/clip-vit-large-patch14", type=str)
+    parser.add_argument(
+        "--pretrained_path",
+        default=None,
+        type=str,
+        help="path to pretrained model",
+    )
+    parser.add_argument(
+        "--image_token_length",
+        type=int,
+        default=256,
+        help="The maximum sequence length.",
+    )
+    parser.add_argument(
+        "--max_seq_len",
+        type=int,
+        default=2048,
+        help="The maximum sequence length.",
+    )
+    parser.add_argument(
+        "--checkpoint_path",
+        default=None,
+        type=str,
+        help="path to pretrained model",
+    )
+    parser.add_argument('--checkpoint_names',
+                        nargs='*',
+                        default=['runing_check_stage2_v3_epoch10',],
+                        help='Path to the training dataset. Accepted format:'
+                        '1) a single data path, 2) multiple datasets in the'
+                        'form: dataset1-path dataset2-path ...')
+    parser.add_argument(
+        "--model_name",
+        default="dsvl",
+        type=str,
+        choices=["dsvl", "toy"],
+        help="path to pretrained model",
+    )
+    parser.add_argument(
+        "--enable_mmca_attention",
+        action='store_true',
+        help="enable the new proposed attn, which is similar to cross attention",
+    )
+    parser.add_argument(
+        "--vis_proj",
+        type=str,
+        default='baseline',
+        help="baseline, vit, or perceiver",
+    )
+    parser.add_argument(
+        "--eval_data",
+        default="dsvl",
+        type=str,
+        help="path to eval data",
+    )
+    parser.add_argument(
+        "--output_filename",
+        default="results",
+        type=str,
+        help="path to eval data",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=123,
+        help="The maximum sequence length.",
+    )
+    parser = deepspeed.add_config_arguments(parser)
+    args = parser.parse_args()
+
+    return args
+
+def main():
+    args = parse_args()
+    with open(f'./eval/eval_data/{args.eval_data}.json', 'r') as file:
+        data = json.load(file)
+    if args.seed is not None:
+        set_seed(args.seed)
+        random.seed(args.seed)
+        np.random.seed(args.seed)
+        torch.manual_seed(args.seed)
+        torch.cuda.manual_seed_all(args.seed)
+            
+    tokenizer = AutoTokenizer.from_pretrained(args.lm_model_name_or_path,
+                                              fast_tokenizer=True)
+    tokenizer.padding_side = 'right'
+    model, image_processor, tokenizer = create_model_and_transforms(
+        text_tokenizer = tokenizer,
+        ds_config=None,
+        args=args,
+    )
+    get_results = collections.defaultdict(list)
+    for ck_name in args.checkpoint_names:
+        ck_path = os.path.join(args.checkpoint_path, ck_name)
+        print (ck_path)
+        if ck_path is not None:
+            model.load_state_dict(torch.load(os.path.join(ck_path, 'pytorch_model.bin'), map_location='cpu'), strict=False) # Z3 wouldn't save pos embeddings (vis and rope)
+        else:
+            Warning("No checkpoint loaded so you cannot genereate meaningful results")
+        #model = model.cuda().half()
+        model = model.eval()
+        model.projection = model.projection.to('cuda')
+        model.vis_encoder = model.vis_encoder.to('cuda')
+        model = model.half()
+        print_rank_0(model)
+        for name in data.keys():
+            question_image_list = data[name]
+            print (f'{args.eval_data}-------------------------------------{name}')
+            images = []
+            system_instruct = []
+            TEMPLATE = DST.Prompter() # get template
+            image_token_dict = DST.get_image_num_map(tokenizer)
+            image_num = 0
+            for round, q_i_pair in enumerate(question_image_list):
+                # print(f'=========round {round+1}==============')
+                question = q_i_pair[0]
+                if len(q_i_pair) > 1:
+                    # seperate by space 
+                    image_paths = q_i_pair[1].split(' ')
+                    tmp_images = []
+                    for image_path in image_paths:
+                        image = Image.open(image_path.strip()).convert('RGB')
+                        tmp_image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0].unsqueeze(0).cuda().half()
+                        tmp_images.append(tmp_image_tensor)                    
+                    images = images + tmp_images # get all images
+                    with_image = True
+                    image_num = len(tmp_images)
+                else:
+                    image_num = 0
+                    with_image = False
+
+                if len(images) > 0:
+                    image_tensor = torch.cat(images, dim=0) # cat all images
+                else:
+                    raise ValueError("No image provided. Did not fix this in the modeling side yet.")
+
+                full_prompt = TEMPLATE(question, with_image=with_image, first_message=(round==0), num_images=image_num)
+                full_prompt_ids = tokenizer(full_prompt).input_ids # remove bos token
+                if with_image:
+                    image_number = len(images)
+                    index = full_prompt_ids.index(image_token_dict[DST.DEFAULT_HUMAN_IMAGE_PRETOKEN])
+                    full_prompt_ids[index] = image_token_dict[DST.image_mapping_dict[str(image_number)]]
+                full_prompt_ids = DST.flatten(full_prompt_ids)
+                input_ids = torch.as_tensor([system_instruct + full_prompt_ids]).cuda() # entire input as system instruction for simplicity
+                print ('\n',round,question, '||', q_i_pair[-1] )
+
+                generate_output = model.generate(image_tensor, input_ids,
+                                                generation_length=256)
+                # generation_kwargs={ 'num_beams':2,'num_return_sequences':1,'top_p':1,'do_sample':True, 'temperature':1}
+                print('vanilla-->', generate_output[1])
+                get_results[name].append([q_i_pair[-1], question, generate_output[1]])
+                extend_ids = generate_output[0].cpu().tolist()[0]
+                while extend_ids[-1] == tokenizer.pad_token_id:
+                    extend_ids.pop()
+                while extend_ids[0] == tokenizer.bos_token_id:
+                    # llama-2 generates bos token at the beginning
+                    extend_ids.pop(0)
+                system_instruct = system_instruct + full_prompt_ids + extend_ids # entire input as system instruction for simplicity
+                system_instruct = system_instruct + [tokenizer.eos_token_id] # add eos token
+                
+    with open(f'{args.output_filename}.csv', mode='w', newline='', encoding='utf-8') as file:
+        writer = csv.writer(file)
+        writer.writerow(['test_name', 'image_path', 'question', 'answer'])
+        for test_name, questions in get_results.items():
+            for question in questions:
+                writer.writerow([test_name] + question)
+        
+                
+        
+        
+if __name__ == "__main__":
+    main()
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/eval_comprehensive.json b/applications/DeepSpeed-VisualChat/eval/eval_data/eval_comprehensive.json
new file mode 100644
index 000000000..a4d367b9b
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/eval_data/eval_comprehensive.json
@@ -0,0 +1,89 @@
+{
+    "cat_images1": [
+        ["Please describe the image in detail.", "./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg"],
+        ["Please describe the image in detail.", "./eval/eval_data/images/cats/british_shorthair.jpg"],
+        ["Please describe the image in detail.", "./eval/eval_data/images/cats/cat.png"],
+        ["Are the colors of the three cats the same?"],
+        ["What are the differences between the first and third images?"],
+        ["What are the differences between the second and third images?"],
+        ["Is the cat in the first image in the sunshine?"]
+    ],
+    "cat_images2": [
+        ["Please describe the image in detail.", "./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg"],
+        ["Please describe the image in detail.", "./eval/eval_data/images/cats/british_shorthair.jpg"],
+        ["What are the differences between the two images?"],
+        ["Please describe the image in detail.", "./eval/eval_data/images/cats/cat.png"],
+        ["Are the colors of the three cats the same?"],
+        ["What are the differences between the first and third images?"],
+        ["What are the differences between the second and third images?"],
+        ["Is the cat in the first image in the sunshine?"],
+        ["Which cat do you prefer and why?"],
+        ["I prefer the second cat. It's so cute."],
+        ["Then why do you prefer the third cat more?"]
+    ],
+    "counting_people1": [
+        ["Count the number of people in the image.", "./eval/eval_data/images/friends/can-count1.jpg"],
+        ["Count the number of people in the image.", "./eval/eval_data/images/friends/can-count2.jpg"],
+        ["What are the differences between the two images? Are they the same group of people? Explain why."],
+        ["Are you familiar with this TV series? Can you name the characters shown in the provided images? Who are they?"]
+    ],
+    "counting_people2":[
+        ["How many individuals are depicted in the image?", "./eval/eval_data/images/friends/can-count1.jpg"],
+        ["How many individuals can you see in the second image?", "./eval/eval_data/images/friends/can-count2.jpg"],
+        ["Can you spot any differences between these two images? Do they represent the same set of people? Please provide a rationale."],
+        ["Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?"]
+    ],
+    "counting_people3": [
+        ["Count the number of people in the image.", "./eval/eval_data/images/friends/wrong-count1.jpg"],
+        ["Count the number of people in the image.", "./eval/eval_data/images/friends/wrong-count2.jpg"],
+        ["What are the differences between the two images? Are they the same group of people? Explain why."]
+    ],
+    "counting_people4": [
+        ["How many individuals are depicted in the image?", "./eval/eval_data/images/friends/wrong-count1.jpg"],
+        ["How many individuals are depicted in the image?", "./eval/eval_data/images/friends/wrong-count2.jpg"],
+        ["Can you spot any differences between these two images?  Do they represent the same set of people? Please provide a rationale."],
+        ["Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?"]
+    ],
+    "zootopia_adventures1": [
+        ["Please describe the image in detail.", "./eval/eval_data/images/zootopia/z1.png"],
+        ["Please describe the image in detail.", "./eval/eval_data/images/zootopia/z2.png"],
+        ["Can you name the characters in the images? Who are they? What are they doing?", "./eval/eval_data/images/zootopia/z3.png"],
+        ["You are an imaginative storyteller. Create a fascinating story based on the first, second and third image."],
+        ["Are you familiar with these characters? What movie are they from?"],
+        ["Can you name the characters in the images? Who are they?"],
+        ["In what type of environment or setting do these characters live? Describe it."]
+    ],
+    "zootopia_adventures2": [
+        ["Create an engaging story strictly based on the images.", "./eval/eval_data/images/zootopia/z1.png ./eval/eval_data/images/zootopia/z2.png ./eval/eval_data/images/zootopia/z3.png"],
+        ["Do you recognize the setting or the characters in these images? Name the movie."],
+        ["Can you share some interesting facts or details about the characters shown in the images?"],
+        ["Which character do you find the most intriguing and why?"],
+        ["Based on the images, can you create some dialogues that the characters might say to each other in these situations?"]
+    ],
+    "zootopia_adventures3": [
+        ["Examine and describe the characters' actions in the first image.", "./eval/eval_data/images/zootopia/z1.png"],
+        ["In the second image, what are the main characters doing, and how do they seem to feel?", "./eval/eval_data/images/zootopia/z2.png"],
+        ["Contrast the characters' moods and interactions in the two provided images."],
+        ["Imagine and narrate a hilarious situation involving the characters from the images.", "./eval/eval_data/images/zootopia/z3.png"],
+        ["Name the movie from which these characters are, and give a succinct summary of its plot."],
+        ["Create a funny and unexpected scenario that could unfold between the characters in these images."]
+    ],
+    "tech_ceos1": [
+        ["Who is this person in this first image?", "./eval/eval_data/images/tech-ceo/jobs1.jpg"],
+        ["Who is this person in this second image?", "./eval/eval_data/images/tech-ceo/gate1.jpg"],
+        ["Who is this person in this third image?", "./eval/eval_data/images/tech-ceo/musk1.jpg"],
+        ["Recall who is in the second image."],
+        ["Recall who is in the first image."],
+        ["Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple."],
+        ["Is the person in the third image the founder of Apple?"]
+    ],    
+    "tech_ceos2": [
+        ["Who is this person in the first image?", "./eval/eval_data/images/tech-ceo/musk1.jpg"],
+        ["Who is this person in the second image?", "./eval/eval_data/images/tech-ceo/gate1.jpg"],
+        ["Who is this person in the third image?", "./eval/eval_data/images/tech-ceo/jobs1.jpg"],
+        ["Recall who is in the second image."],
+        ["Recall who is in the first image."],
+        ["Is the person in the first image the founder of Apple? If not, which of the above images is the person the founder of Apple. "],
+        ["Is the person in the third image the founder of Apple?"]
+    ]
+}
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/eval_robustness.json b/applications/DeepSpeed-VisualChat/eval/eval_data/eval_robustness.json
new file mode 100644
index 000000000..16747af32
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/eval_data/eval_robustness.json
@@ -0,0 +1,78 @@
+{
+    "tech_ceos2.1a": [
+        ["Who is this person in the image?", "./eval/eval_data/images/tech-ceo/jobs1.jpg"],
+        ["Who is this person in the image?", "./eval/eval_data/images/tech-ceo/gate1.jpg"],
+        ["Who is this person in the image?", "./eval/eval_data/images/tech-ceo/musk1.jpg"],
+        ["Recall who is in the second image."],
+        ["Recall who is in the first image."],
+        ["Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple."],
+        ["Is the person in the third image the founder of Apple?"]
+    ],
+    "tech_ceos2.1b": [
+        ["Who is this person in this image?", "./eval/eval_data/images/tech-ceo/jobs1.jpg"],
+        ["Who is this person in this image?", "./eval/eval_data/images/tech-ceo/gate1.jpg"],
+        ["Who is this person in this image?", "./eval/eval_data/images/tech-ceo/musk1.jpg"],
+        ["Recall who is in the second image."],
+        ["Recall who is in the first image."],
+        ["Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple."],
+        ["Is the person in the third image the founder of Apple?"]
+    ],
+    "tech_ceos2.1c": [
+        ["Who is this person in this image a?", "./eval/eval_data/images/tech-ceo/jobs1.jpg"],
+        ["Who is this person in this image b?", "./eval/eval_data/images/tech-ceo/gate1.jpg"],
+        ["Who is this person in this image c?", "./eval/eval_data/images/tech-ceo/musk1.jpg"],
+        ["Recall who is in image b."],
+        ["Recall who is in the image a."],
+        ["Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple."],
+        ["Is the person in the image c the founder of Apple?"]
+    ],
+    "tech_ceos2.1d": [
+        ["Who is this person in this first image?", "./eval/eval_data/images/tech-ceo/jobs1.jpg"],
+        ["Who is this person in this second image?", "./eval/eval_data/images/tech-ceo/gate1.jpg"],
+        ["Who is this person in this third image?", "./eval/eval_data/images/tech-ceo/musk1.jpg"],
+        ["Recall who is in the second image."],
+        ["Recall who is in the first image."],
+        ["Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple."],
+        ["Is the person in the third image the founder of Apple?"]
+    ],
+    "tech_ceos2.1aa": [
+        ["Who is this person in the image?", "./eval/eval_data/images/tech-ceo/jobs1.jpg"],
+        ["Who is this person in the image?", "./eval/eval_data/images/tech-ceo/gate1.jpg"],
+        ["What's the differnce between the first and second image"],
+        ["Who is this person in the image?", "./eval/eval_data/images/tech-ceo/musk1.jpg"],
+        ["Recall who is in the second image."],
+        ["Recall who is in the first image."],
+        ["Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple."],
+        ["Is the person in the third image the founder of Apple?"]
+    ],
+    "tech_ceos2.1bb": [
+        ["Who is this person in this image?", "./eval/eval_data/images/tech-ceo/jobs1.jpg"],
+        ["Who is this person in this image?", "./eval/eval_data/images/tech-ceo/gate1.jpg"],
+        ["What's the differnce between the first and second images"],
+        ["Who is this person in this image?", "./eval/eval_data/images/tech-ceo/musk1.jpg"],
+        ["Recall who is in the second image."],
+        ["Recall who is in the first image."],
+        ["Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple."],
+        ["Is the person in the third image the founder of Apple?"]
+    ],
+    "tech_ceos2.1cc": [
+        ["Who is this person in this image a?", "./eval/eval_data/images/tech-ceo/jobs1.jpg"],
+        ["Who is this person in this image b?", "./eval/eval_data/images/tech-ceo/gate1.jpg"],
+        ["What's the differnce between the image a and image b"],
+        ["Who is this person in this image c?", "./eval/eval_data/images/tech-ceo/musk1.jpg"],
+        ["Recall who is in image b."],
+        ["Recall who is in the image a."],
+        ["Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple."],
+        ["Is the person in the image c the founder of Apple?"]
+    ],
+    "tech_ceos2.1dd": [
+        ["Who is this person in this first image?", "./eval/eval_data/images/tech-ceo/jobs1.jpg"],
+        ["Who is this person in this second image?", "./eval/eval_data/images/tech-ceo/gate1.jpg"],
+        ["What's the differnce between the first and second images"],
+        ["Who is this person in this third image?", "./eval/eval_data/images/tech-ceo/musk1.jpg"],
+        ["Recall who is in the second image."],
+        ["Recall who is in the first image."],
+        ["Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple."],
+        ["Is the person in the third image the founder of Apple?"]
+    ]
+}
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/eval_single.json b/applications/DeepSpeed-VisualChat/eval/eval_data/eval_single.json
new file mode 100644
index 000000000..42a7ad95e
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/eval_data/eval_single.json
@@ -0,0 +1,11 @@
+{
+"cat_images1": [["please describe the image", "./eval/eval_data/images/cats/cat.png"]],
+"cat_images2": [["can you describe the image", "./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg"]],
+"cat_images3": [["please describe the image", "./eval/eval_data/images/cats/british_shorthair.jpg"]],
+"extreme_ironing": [["What is unusual about this image?", "./eval/eval_data/images/singles/extreme_ironing.jpg"]],
+"waterview": [["What are the things I should be cautious about when I visit here?", "./eval/eval_data/images/singles/waterview.jpg"]],
+"art-dog": [["can you describe the image", "./eval/eval_data/images/singles/202160027_b319c4166e.jpg"]],
+"funny-phone": [["What is funny about this image? Describe it panel by panel.", "./eval/eval_data/images/singles/1.jpg"]],
+"squirrel": [["Why would a person find this image funny?", "./eval/eval_data/images/singles/2.jpg"]],
+"art-painting": [["Tell me about this work of art.", "./eval/eval_data/images/singles/50.jpg"]]
+}
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/cats/1806905748_adb926a0a0.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/cats/1806905748_adb926a0a0.jpg
new file mode 100644
index 000000000..100eccc42
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/cats/1806905748_adb926a0a0.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/cats/british_shorthair.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/cats/british_shorthair.jpg
new file mode 100644
index 000000000..b61731c62
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/cats/british_shorthair.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/cats/cat.png b/applications/DeepSpeed-VisualChat/eval/eval_data/images/cats/cat.png
new file mode 100644
index 000000000..1a48d45c4
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/cats/cat.png differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/can-count1.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/can-count1.jpg
new file mode 100644
index 000000000..b29d3a97d
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/can-count1.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/can-count2.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/can-count2.jpg
new file mode 100644
index 000000000..b09d1694a
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/can-count2.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/wrong-count1.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/wrong-count1.jpg
new file mode 100644
index 000000000..2d4b1b958
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/wrong-count1.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/wrong-count2.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/wrong-count2.jpg
new file mode 100644
index 000000000..08ac55fe2
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/friends/wrong-count2.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/1.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/1.jpg
new file mode 100644
index 000000000..69984e57b
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/1.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/2.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/2.jpg
new file mode 100644
index 000000000..ca1232162
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/2.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/202160027_b319c4166e.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/202160027_b319c4166e.jpg
new file mode 100644
index 000000000..8628f3d7b
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/202160027_b319c4166e.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/50.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/50.jpg
new file mode 100644
index 000000000..f23f0548d
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/50.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/extreme_ironing.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/extreme_ironing.jpg
new file mode 100644
index 000000000..638b07883
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/extreme_ironing.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/waterview.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/waterview.jpg
new file mode 100644
index 000000000..6f44ebaba
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/singles/waterview.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/tech-ceo/gate1.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/tech-ceo/gate1.jpg
new file mode 100644
index 000000000..b7b747294
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/tech-ceo/gate1.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/tech-ceo/jobs1.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/tech-ceo/jobs1.jpg
new file mode 100644
index 000000000..18e8d35e9
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/tech-ceo/jobs1.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/tech-ceo/musk1.jpg b/applications/DeepSpeed-VisualChat/eval/eval_data/images/tech-ceo/musk1.jpg
new file mode 100644
index 000000000..7f2abfe89
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/tech-ceo/musk1.jpg differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z1.png b/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z1.png
new file mode 100644
index 000000000..fdb9d8db9
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z1.png differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z2.png b/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z2.png
new file mode 100644
index 000000000..57766b181
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z2.png differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z2a.png b/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z2a.png
new file mode 100644
index 000000000..79f30e02d
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z2a.png differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z3.png b/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z3.png
new file mode 100644
index 000000000..8ccd23f53
Binary files /dev/null and b/applications/DeepSpeed-VisualChat/eval/eval_data/images/zootopia/z3.png differ
diff --git a/applications/DeepSpeed-VisualChat/eval/eval_scripts/run_batch.sh b/applications/DeepSpeed-VisualChat/eval/eval_scripts/run_batch.sh
new file mode 100644
index 000000000..dd79b0a00
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/eval_scripts/run_batch.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+# DeepSpeed Team
+
+#EVAL_DATSET=eval_robustness  eval_single eval_comprehensive (see the json in the folder ./eval_data/*.json)
+MAIN_PATH=$1
+VISION_MODEL=/blob/transformers_cache/qwen-clip #openai/clip-vit-large-patch14
+LLM=/blob/transformers_cache/Llama-2-13b-hf #meta-llama/Llama-2-7b 
+for EVAL_DATSET in eval_single eval_comprehensive eval_robustness 
+do
+    SAVE_PATH=eval/results/${EVAL_DATSET}
+    mkdir ${SAVE_PATH}
+    for CKPT_NAME in final best_eval
+    do
+    #NOTE: to run multi-GPU, you simple do "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7;"
+    export CUDA_VISIBLE_DEVICES=0; python eval/batch_generation.py --model_name dsvl --vis_proj baseline --max_seq_len 4096 \
+        --lm_model_name_or_path  ${LLM} --vision_model_name_or_path ${VISION_MODEL} \
+        --checkpoint_path $MAIN_PATH  --checkpoint_names $CKPT_NAME --eval_data ${EVAL_DATSET} \
+        --enable_mmca_attention --output_filename ${SAVE_PATH}/ours_${CKPT_NAME} &> ${SAVE_PATH}/ours_${CKPT_NAME}.log
+    done
+done
+
diff --git a/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set1_best_eval.csv b/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set1_best_eval.csv
new file mode 100644
index 000000000..dd83ebf25
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set1_best_eval.csv
@@ -0,0 +1,78 @@
+test_name,image_path,question,answer
+cat_images1,./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg,Please describe the image in detail.,"The image shows a black cat lying on its back on a carpeted floor. The cat has its paws up in the air and is looking directly at the camera. The cat's eyes are yellow, and it appears to be relaxed and comfortable in its position. The background of the image is not particularly distinct, but the focus is on the cat and its pose."
+cat_images1,./eval/eval_data/images/cats/british_shorthair.jpg,Please describe the image in detail.,"The image shows a gray and white cat lying on a couch, with its paws stretched out and its head resting on the couch. The cat appears to be relaxed and comfortable in its position."
+cat_images1,./eval/eval_data/images/cats/cat.png,Please describe the image in detail.,"The image shows a brown and white cat standing on a green leafy plant, with its paws up in the air and its head tilted to the side. The cat appears to be curious and alert, as it looks around its surroundings. The background of the image is not particularly distinct, but the focus is on the cat and its pose."
+cat_images1,Are the colors of the three cats the same?,Are the colors of the three cats the same?,"No, the colors of the three cats are different. The first cat is black, the second cat is gray and white, and the third cat is brown and white."
+cat_images1,What are the differences between the first and third images?,What are the differences between the first and third images?,The first image shows a black cat lying on a carpet while the third image shows a brown and white cat standing on a green leafy plant.
+cat_images1,What are the differences between the second and third images?,What are the differences between the second and third images?,The second image shows a gray and white cat lying on a couch while the third image shows a brown and white cat standing on a green leafy plant.
+cat_images1,Is the cat in the first image in the sunshine?,Is the cat in the first image in the sunshine?,"Yes, the cat in the first image is lying in the sunshine."
+cat_images2,./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg,Please describe the image in detail.,"The image shows a black cat lying on its back on a carpeted floor. The cat has its paws up in the air and is looking directly at the camera. The cat's eyes are yellow, and it appears to be relaxed and comfortable in its position. The background of the image is not particularly distinct, but the focus is on the cat and its pose."
+cat_images2,./eval/eval_data/images/cats/british_shorthair.jpg,Please describe the image in detail.,"The image shows a gray and white cat lying on a couch, with its paws stretched out and its head resting on the couch. The cat appears to be relaxed and comfortable in its position."
+cat_images2,What are the differences between the two images?,What are the differences between the two images?,The first image shows a black cat lying on a carpet while the second image shows a gray and white cat lying on a couch.
+cat_images2,./eval/eval_data/images/cats/cat.png,Please describe the image in detail.,"The image shows a cat lying on a carpeted floor, with its paws stretched out and its head resting on the floor. The cat appears to be relaxed and comfortable in its position."
+cat_images2,Are the colors of the three cats the same?,Are the colors of the three cats the same?,"No, the colors of the three cats are different. The first cat is black, the second cat is gray and white, and the third cat is brown and white."
+cat_images2,What are the differences between the first and third images?,What are the differences between the first and third images?,The first image shows a black cat lying on a carpet while the third image shows a brown and white cat lying on a couch.
+cat_images2,What are the differences between the second and third images?,What are the differences between the second and third images?,The second image shows a gray and white cat lying on a couch while the third image shows a brown and white cat lying on a couch.
+cat_images2,Is the cat in the first image in the sunshine?,Is the cat in the first image in the sunshine?,"No, the cat in the first image is not in the sunshine. It is lying on a carpeted floor."
+cat_images2,Which cat do you prefer and why?,Which cat do you prefer and why?,"I prefer the gray and white cat in the second image because it looks calm and relaxed, lying on a couch with its paws stretched out. The cat's expression seems to convey a sense of contentment and comfort, which is appealing to me."
+cat_images2,I prefer the second cat. It's so cute.,I prefer the second cat. It's so cute.,"I agree, the second cat is very cute and looks comfortable lying on the couch."
+cat_images2,Then why do you prefer the third cat more?,Then why do you prefer the third cat more?,"I prefer the third cat because it is brown and white, which is a classic and timeless color combination. Additionally, the third cat is lying on a couch, which is a more comfortable and inviting setting than the carpet in the first image."
+counting_people1,./eval/eval_data/images/friends/can-count1.jpg,Count the number of people in the image.,"There are six people in the image, including the two women and four men."
+counting_people1,./eval/eval_data/images/friends/can-count2.jpg,Count the number of people in the image.,"The number of people in the image is not specified, but there are at least six people in the group."
+counting_people1,What are the differences between the two images? Are they the same group of people? Explain why.,What are the differences between the two images? Are they the same group of people? Explain why.,"The first image has two women and four men, while the second image has four women and two men. The people in the two images are not the same, and the number of men and women is different. This is because the images are taken at different times and with different groups of people."
+counting_people1,Are you familiar with this TV series? Can you name the characters shown in the provided images? Who are they?,Are you familiar with this TV series? Can you name the characters shown in the provided images? Who are they?,"Yes, I am familiar with this TV series. The characters in the images are from the TV series ""Friends."" The characters are: Rachel, Monica, Phoebe, Joey, Chandler, and Ross."
+counting_people2,./eval/eval_data/images/friends/can-count1.jpg,How many individuals are depicted in the image?,"There are five individuals in the image, including the two women and three men."
+counting_people2,./eval/eval_data/images/friends/can-count2.jpg,How many individuals can you see in the second image?,There are six individuals in the second image.
+counting_people2,Can you spot any differences between these two images? Do they represent the same set of people? Please provide a rationale.,Can you spot any differences between these two images? Do they represent the same set of people? Please provide a rationale.,"The first image is a portrait of two women and three men, while the second image is a group photo of six people. The two images do not represent the same set of people."
+counting_people2,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,"Yes, you can recognize the TV show from the images. The characters are the cast of the TV show ""Friends."""
+counting_people3,./eval/eval_data/images/friends/wrong-count1.jpg,Count the number of people in the image.,There are four people in the image.
+counting_people3,./eval/eval_data/images/friends/wrong-count2.jpg,Count the number of people in the image.,There are six people in the image.
+counting_people3,What are the differences between the two images? Are they the same group of people? Explain why.,What are the differences between the two images? Are they the same group of people? Explain why.,The first image has four people sitting on a bench while the second image has six people standing in a group. The people in the two images are not the same. The first image has a fountain in the background while the second image has trees in the background. The first image has a dog in it while the second image does not have any animals.
+counting_people4,./eval/eval_data/images/friends/wrong-count1.jpg,How many individuals are depicted in the image?,There are four individuals depicted in the image.
+counting_people4,./eval/eval_data/images/friends/wrong-count2.jpg,How many individuals are depicted in the image?,There are six individuals in the image.
+counting_people4,Can you spot any differences between these two images?  Do they represent the same set of people? Please provide a rationale.,Can you spot any differences between these two images?  Do they represent the same set of people? Please provide a rationale.,"Yes, the two images have different people in them. The first image has four people sitting on a bench, while the second image has six people standing in a group."
+counting_people4,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,"Yes, I recognize the TV show. The characters are John, Paul, George, and Ringo."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z1.png,Please describe the image in detail.,"The image is a digital recreation of a scene featuring two stuffed animals, a dog and a cat, sitting at a desk with a phone in front of them. The desk has a sign that says ""Next Window."" There are also two people in the background, one on the left and one on the right."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z2.png,Please describe the image in detail.,"The image is a digital recreation of a scene featuring a group of stuffed animals, including a fox, a raccoon, and a squirrel, standing in front of a phone booth. There are also two people in the background, one on the left and one on the right."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z3.png,Can you name the characters in the images? Who are they? What are they doing?,"The characters in the images are a fox, a raccoon, and a squirrel. They are standing in front of a phone booth and a coffee shop."
+zootopia_adventures1,"You are an imaginative storyteller. Create a fascinating story based on the first, second and third image.","You are an imaginative storyteller. Create a fascinating story based on the first, second and third image.","Sure, you can create a story about the characters in the images. For example, you can imagine that the fox, raccoon, and squirrel are friends who are on a mission to save the world from an evil villain. They are using the phone booth and coffee shop as a base to plan their strategy and gather resources."
+zootopia_adventures1,Are you familiar with these characters? What movie are they from?,Are you familiar with these characters? What movie are they from?,"Yes, I am familiar with these characters. They are from the movie ""Zootopia."""
+zootopia_adventures1,Can you name the characters in the images? Who are they?,Can you name the characters in the images? Who are they?,"Yes, the characters in the images are a fox, a raccoon, and a squirrel."
+zootopia_adventures1,In what type of environment or setting do these characters live? Describe it.,In what type of environment or setting do these characters live? Describe it.,The characters in the images live in a phone booth and a coffee shop.
+zootopia_adventures2,./eval/eval_data/images/zootopia/z1.png ./eval/eval_data/images/zootopia/z2.png ./eval/eval_data/images/zootopia/z3.png,Create an engaging story strictly based on the images.,"The engaging story could be about a group of stuffed animals who are waiting in line at a bank or a store. They are discussing the next step in their transaction and are eager to get their business done. The story could also involve a funny incident that happens while they are in line, such as one of the stuffed animals losing their head or getting into a fight with another animal. The story could be told from the perspective of one of the stuffed animals or from an observer's point of view."
+zootopia_adventures2,Do you recognize the setting or the characters in these images? Name the movie.,Do you recognize the setting or the characters in these images? Name the movie.,"The setting is a bank or a store, and the characters are stuffed animals. The movie is not specified."
+zootopia_adventures2,Can you share some interesting facts or details about the characters shown in the images?,Can you share some interesting facts or details about the characters shown in the images?,"The characters shown in the images are stuffed animals from the movie ""Toy Story."" The main characters are Woody, a cowboy, and Buzz Lightyear, a space ranger. Other characters include Mr. Potato Head, Rex the dinosaur, and Slinky the dog."
+zootopia_adventures2,Which character do you find the most intriguing and why?,Which character do you find the most intriguing and why?,"I find Mr. Potato Head to be the most intriguing character because he can be assembled in different ways using various parts, making him versatile and unique."
+zootopia_adventures2,"Based on the images, can you create some dialogues that the characters might say to each other in these situations?","Based on the images, can you create some dialogues that the characters might say to each other in these situations?","Sure, here are some possible dialogues:
+- ""Hey, let's get in line for the next window.""
+- ""I'm not sure what we're doing here, but let's follow the others.""
+- ""I hope we can get our business done quickly so we can go back to playing.""
+- ""I'm not sure if this is the right place, but it looks like everyone else is here.""
+- ""I'm not sure what's going on, but let's just go with the flow.""
+- ""I think we should ask someone for help to figure out what we need to do.""
+- ""I'm not sure if we're in the right place, but let's just try to blend in and see what happens.""
+- ""I'm not sure if this is the right line, but it seems like everyone else is in it.""
+- ""I'm not sure what's going on, but let's just follow the others and see where it leads.""
+- ""I'm not sure if this is the right place, but it looks like everyone else is here, so let's just go with it."""
+zootopia_adventures3,./eval/eval_data/images/zootopia/z1.png,Examine and describe the characters' actions in the first image.,"In the first image, the two stuffed animals are sitting at a desk with a sign that says ""Next Window."" One of the stuffed animals is holding a coffee cup, while the other is holding a phone. They appear to be waiting for something at the window."
+zootopia_adventures3,./eval/eval_data/images/zootopia/z2.png,"In the second image, what are the main characters doing, and how do they seem to feel?","In the second image, the main characters are a fox and a stuffed animal. They are standing in front of a sign that says ""Next Window."" The fox seems to be holding a phone, while the stuffed animal is holding a coffee cup. They appear to be interacting with each other and seem to be happy or at least not showing any negative emotions."
+zootopia_adventures3,Contrast the characters' moods and interactions in the two provided images.,Contrast the characters' moods and interactions in the two provided images.,"In the first image, the two stuffed animals are sitting at a desk with a sign that says ""Next Window."" They appear to be waiting for something, but their moods cannot be determined. 
+
+In the second image, the fox and the stuffed animal are standing in front of a sign that says ""Next Window."" The fox is holding a phone and the stuffed animal is holding a coffee cup. They seem to be interacting with each other and appear to be happy or at least not showing any negative emotions."
+zootopia_adventures3,./eval/eval_data/images/zootopia/z3.png,Imagine and narrate a hilarious situation involving the characters from the images.,"In the hilarious situation, the fox and the stuffed animal are having a conversation in front of a sign that says ""Next Window."" The fox is holding a phone and the stuffed animal is holding a coffee cup. Suddenly, the fox gets a call and starts talking on the phone, completely ignoring the stuffed animal. The stuffed animal gets frustrated and starts yelling at the fox, demanding his attention. The fox, still focused on the phone, pays no attention to the stuffed animal's yelling. The stuffed animal gets even more upset and starts throwing the coffee cup at the fox, but the cup misses and hits the wall behind them. The fox, oblivious to the situation, continues talking on the phone, while the stuffed animal storms off in a huff."
+zootopia_adventures3,"Name the movie from which these characters are, and give a succinct summary of its plot.","Name the movie from which these characters are, and give a succinct summary of its plot.","The movie is ""Next Window,"" and the plot involves a fox and a stuffed animal having a conversation in front of a sign that says ""Next Window."" The fox is holding a phone and the stuffed animal is holding a coffee cup. The fox gets a call on the phone and starts talking, ignoring the stuffed animal. The stuffed animal gets frustrated and starts yelling at the fox, demanding his attention. The fox continues talking on the phone, and the stuffed animal throws a coffee cup at the fox, but misses and hits the wall behind them. The fox pays no attention to the stuffed animal's outburst and continues talking on the phone."
+zootopia_adventures3,Create a funny and unexpected scenario that could unfold between the characters in these images.,Create a funny and unexpected scenario that could unfold between the characters in these images.,"In this funny and unexpected scenario, the fox and the stuffed animal are having a conversation in front of a sign that says ""Next Window."" The fox is holding a phone and the stuffed animal is holding a coffee cup. Suddenly, the fox gets a call on the phone and starts talking, ignoring the stuffed animal. The stuffed animal gets frustrated and starts yelling at the fox, demanding his attention. The fox continues talking on the phone, completely oblivious to the stuffed animal's outburst. The stuffed animal gets even more upset and throws the coffee cup at the fox, but misses and hits the wall behind them. The fox, still focused on the phone call, pays no attention to the coffee cup hitting the wall and continues talking. The stuffed animal, now furious, starts jumping up and down, trying to get the fox's attention. The fox, still engrossed in the phone call, ignores the stuffed animal's antics and continues talking. The stuffed animal, in a last-ditch effort to get the fox's attention, starts singing a song, but the fox"
+tech_ceos1,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"The person in the first image is Steve Jobs, the CEO of Apple."
+tech_ceos1,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos1,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is a man wearing a suit and tie, and he is holding a microphone."
+tech_ceos1,Recall who is in the second image.,Recall who is in the second image.,The second image shows a man holding a microphone in front of a large sign.
+tech_ceos1,Recall who is in the first image.,Recall who is in the first image.,The first image shows Steve Jobs standing in front of a podium.
+tech_ceos1,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is the founder of Apple."
+tech_ceos1,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is not the founder of Apple."
+tech_ceos2,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the first image?,"The person in the first image is a man wearing a white shirt and a black tie, who is holding a microphone in his hand."
+tech_ceos2,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the second image?,"The person in the second image is Bill Gates, the founder of Microsoft."
+tech_ceos2,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the third image?,"The person in the third image is Steve Jobs, the founder of Apple."
+tech_ceos2,Recall who is in the second image.,Recall who is in the second image.,There is no one in the second image.
+tech_ceos2,Recall who is in the first image.,Recall who is in the first image.,There is a man in the first image who is holding a microphone in his hand.
+tech_ceos2,"Is the person in the first image the founder of Apple? If not, which of the above images is the person the founder of Apple. ","Is the person in the first image the founder of Apple? If not, which of the above images is the person the founder of Apple. ","No, the person in the first image is not the founder of Apple. The person in the second image is the founder of Apple."
+tech_ceos2,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is the founder of Apple, Steve Jobs."
diff --git a/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set1_final.csv b/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set1_final.csv
new file mode 100644
index 000000000..66c30e318
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set1_final.csv
@@ -0,0 +1,75 @@
+test_name,image_path,question,answer
+cat_images1,./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg,Please describe the image in detail.,"The image shows a black cat lying on a carpeted floor, with its paws stretched out and its eyes closed. The cat appears to be relaxed and comfortable, possibly sleeping or resting. The background of the image is not particularly distinct, but the focus is on the cat and its position on the floor."
+cat_images1,./eval/eval_data/images/cats/british_shorthair.jpg,Please describe the image in detail.,"The image shows a gray and white cat lying on a couch, with its paws stretched out and its eyes closed. The cat appears to be relaxed and comfortable on the couch."
+cat_images1,./eval/eval_data/images/cats/cat.png,Please describe the image in detail.,"The image shows a brown and white cat lying on a green leafy surface, possibly a grassy area. The cat is stretched out and appears to be relaxed, with its paws extended and its eyes closed. The background of the image is not particularly distinct, but it appears to be outdoors, surrounded by greenery."
+cat_images1,Are the colors of the three cats the same?,Are the colors of the three cats the same?,"No, the colors of the three cats are different. The first cat is black, the second cat is gray and white, and the third cat is brown and white."
+cat_images1,What are the differences between the first and third images?,What are the differences between the first and third images?,The first image shows a black cat lying on a carpet while the third image shows a brown and white cat lying on a green leafy surface.
+cat_images1,What are the differences between the second and third images?,What are the differences between the second and third images?,The second image shows a gray and white cat lying on a couch while the third image shows a brown and white cat lying on a green leafy surface.
+cat_images1,Is the cat in the first image in the sunshine?,Is the cat in the first image in the sunshine?,"Yes, the cat in the first image is lying in the sunshine."
+cat_images2,./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg,Please describe the image in detail.,"The image shows a black cat lying on a carpeted floor, with its paws stretched out and its eyes closed. The cat appears to be relaxed and comfortable, possibly sleeping or resting. The background of the image is not particularly distinct, but the focus is on the cat and its position on the floor."
+cat_images2,./eval/eval_data/images/cats/british_shorthair.jpg,Please describe the image in detail.,"The image shows a gray and white cat lying on a couch, with its paws stretched out and its eyes closed. The cat appears to be relaxed and comfortable on the couch."
+cat_images2,What are the differences between the two images?,What are the differences between the two images?,The first image shows a black cat lying on a carpeted floor while the second image shows a gray and white cat lying on a couch.
+cat_images2,./eval/eval_data/images/cats/cat.png,Please describe the image in detail.,"The image shows a brown and white cat sitting on a carpeted floor, with its paws stretched out and its eyes closed. The cat appears to be relaxed and comfortable in its position."
+cat_images2,Are the colors of the three cats the same?,Are the colors of the three cats the same?,"No, the colors of the three cats are different. The first cat is black, the second cat is gray and white, and the third cat is brown and white."
+cat_images2,What are the differences between the first and third images?,What are the differences between the first and third images?,The first image shows a black cat lying on a carpeted floor while the third image shows a brown and white cat sitting on a carpeted floor.
+cat_images2,What are the differences between the second and third images?,What are the differences between the second and third images?,The second image shows a gray and white cat lying on a couch while the third image shows a brown and white cat sitting on a carpeted floor.
+cat_images2,Is the cat in the first image in the sunshine?,Is the cat in the first image in the sunshine?,"No, the cat in the first image is not in the sunshine. It is lying on a carpeted floor."
+cat_images2,Which cat do you prefer and why?,Which cat do you prefer and why?,"I prefer the gray and white cat in the second image because it looks calm and relaxed while lying on the couch. The cat's eyes are closed, and it seems to be enjoying the comfort of the couch. The image also shows the cat's paws stretched out, which adds to its relaxed appearance."
+cat_images2,I prefer the second cat. It's so cute.,I prefer the second cat. It's so cute.,"I agree, the second cat is very cute and looks relaxed while lying on the couch."
+cat_images2,Then why do you prefer the third cat more?,Then why do you prefer the third cat more?,"I prefer the third cat because it is brown and white, which is a classic and attractive color combination for cats. Additionally, the third cat is sitting on a carpeted floor, which provides a cozy and comfortable environment for the cat to relax in."
+counting_people1,./eval/eval_data/images/friends/can-count1.jpg,Count the number of people in the image.,"There are six people in the image, including the two women and four men."
+counting_people1,./eval/eval_data/images/friends/can-count2.jpg,Count the number of people in the image.,"The number of people in the image is not specified, but there are at least six people, including two women and four men."
+counting_people1,What are the differences between the two images? Are they the same group of people? Explain why.,What are the differences between the two images? Are they the same group of people? Explain why.,"The first image has six people, including two women and four men, while the second image also has six people, including two women and four men. However, the people in the two images are not the same as they have different positions and backgrounds. The first image has the people sitting on a couch, while the second image has the people standing in front of a wall. The first image has the people's faces visible, while the second image has some of the faces partially hidden. The reason for the difference is that the two images are taken at different times and locations, with different people posing for the pictures."
+counting_people1,Are you familiar with this TV series? Can you name the characters shown in the provided images? Who are they?,Are you familiar with this TV series? Can you name the characters shown in the provided images? Who are they?,"Yes, I am familiar with the TV series. The characters in the images are from the TV series Friends. The characters are: Monica, Rachel, Phoebe, Joey, Chandler, and Ross."
+counting_people2,./eval/eval_data/images/friends/can-count1.jpg,How many individuals are depicted in the image?,There are four individuals in the image.
+counting_people2,./eval/eval_data/images/friends/can-count2.jpg,How many individuals can you see in the second image?,There are six individuals in the second image.
+counting_people2,Can you spot any differences between these two images? Do they represent the same set of people? Please provide a rationale.,Can you spot any differences between these two images? Do they represent the same set of people? Please provide a rationale.,"The two images represent different sets of people. The first image shows four people sitting on a couch, while the second image shows six people standing in a row."
+counting_people2,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,"No, these images do not represent a TV show. The characters in the images are not named."
+counting_people3,./eval/eval_data/images/friends/wrong-count1.jpg,Count the number of people in the image.,There are four people in the image.
+counting_people3,./eval/eval_data/images/friends/wrong-count2.jpg,Count the number of people in the image.,There are six people in the image.
+counting_people3,What are the differences between the two images? Are they the same group of people? Explain why.,What are the differences between the two images? Are they the same group of people? Explain why.,"The first image has four people sitting on a bench, while the second image has six people standing next to each other. The people in the two images are not the same. The reason is that the first image is a group of friends posing for a picture, while the second image is a group of actors posing for a promotional photo."
+counting_people4,./eval/eval_data/images/friends/wrong-count1.jpg,How many individuals are depicted in the image?,There are four individuals depicted in the image.
+counting_people4,./eval/eval_data/images/friends/wrong-count2.jpg,How many individuals are depicted in the image?,There are six individuals in the image.
+counting_people4,Can you spot any differences between these two images?  Do they represent the same set of people? Please provide a rationale.,Can you spot any differences between these two images?  Do they represent the same set of people? Please provide a rationale.,"Yes, there are several differences between the two images. The first image has four people sitting on a bench, while the second image has six people standing. The people in the first image are wearing different clothes, while the people in the second image are wearing similar clothes. The location of the first image is a park, while the second image is taken in front of a tree."
+counting_people4,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,"Yes, I recognize the characters from the TV show Friends."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z1.png,Please describe the image in detail.,"The image is a digital recreation of a scene featuring two stuffed animals, a dog and a cat, sitting at a desk with a phone in front of them. The desk has a sign that says ""Next Window."" There are also two people in the background, one on the left and one on the right."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z2.png,Please describe the image in detail.,"The image is a digital recreation of a scene featuring a raccoon, a fox, and a squirrel standing in front of a counter with a sign that says ""Next Window."" There are also two people in the background, one on the left and one on the right."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z3.png,Can you name the characters in the images? Who are they? What are they doing?,"The characters in the images are a fox and a raccoon. They are sitting at a desk with a phone and a sign that says ""Next Window."" The fox is holding a coffee mug and the raccoon is holding a cup. They are likely discussing something or waiting for someone."
+zootopia_adventures1,"You are an imaginative storyteller. Create a fascinating story based on the first, second and third image.","You are an imaginative storyteller. Create a fascinating story based on the first, second and third image.","The story could be about a fox and a raccoon who are working together in an office. They are sitting at a desk with a phone and a sign that says ""Next Window"". The fox is holding a coffee mug and the raccoon is holding a cup. They are discussing something important while waiting for someone to come to their desk."
+zootopia_adventures1,Are you familiar with these characters? What movie are they from?,Are you familiar with these characters? What movie are they from?,"Yes, I am familiar with these characters. They are from the movie ""Zootopia""."
+zootopia_adventures1,Can you name the characters in the images? Who are they?,Can you name the characters in the images? Who are they?,The characters in the images are a fox and a raccoon.
+zootopia_adventures1,In what type of environment or setting do these characters live? Describe it.,In what type of environment or setting do these characters live? Describe it.,"The characters live in an office setting. They are sitting at a desk with a phone and a sign that says ""Next Window""."
+zootopia_adventures2,./eval/eval_data/images/zootopia/z1.png ./eval/eval_data/images/zootopia/z2.png ./eval/eval_data/images/zootopia/z3.png,Create an engaging story strictly based on the images.,The engaging story could be about a group of stuffed animals who are waiting in line at a bank or a store. They might be discussing their plans or sharing their experiences while waiting. The story could be told from the perspective of one of the stuffed animals or from an outside observer.
+zootopia_adventures2,Do you recognize the setting or the characters in these images? Name the movie.,Do you recognize the setting or the characters in these images? Name the movie.,"The setting is a bank or a store, and the characters are stuffed animals. The movie is not specified."
+zootopia_adventures2,Can you share some interesting facts or details about the characters shown in the images?,Can you share some interesting facts or details about the characters shown in the images?,"The characters shown are stuffed animals, including a teddy bear, a fox, and a hippo. They are not real animals, but rather toys or figurines."
+zootopia_adventures2,Which character do you find the most intriguing and why?,Which character do you find the most intriguing and why?,I find the fox character the most intriguing because it is a stuffed animal that is not commonly seen in this context. It adds a unique element to the scene and makes it stand out.
+zootopia_adventures2,"Based on the images, can you create some dialogues that the characters might say to each other in these situations?","Based on the images, can you create some dialogues that the characters might say to each other in these situations?","Sure, here are some possible dialogues:
+- ""I'm waiting for my turn at the bank.""
+- ""I'm here to purchase something at the store.""
+- ""I hope we don't have to wait too long.""
+- ""I'm excited to see what's in store for us.""
+- ""I'm curious about what's going on around us.""
+- ""I'm glad we're all here together.""
+- ""I'm looking forward to our next adventure."""
+zootopia_adventures3,./eval/eval_data/images/zootopia/z1.png,Examine and describe the characters' actions in the first image.,"The characters in the first image are a group of stuffed animals, including a plush elephant, a plush horse, and a plush dog, standing behind a counter in a mock-up of a bank or store. The elephant is holding a sign that says ""Next Window,"" while the horse and dog are standing behind the counter. The scene is likely meant to depict a playful or imaginative representation of a bank or store setting, with the stuffed animals acting as the employees or customers."
+zootopia_adventures3,./eval/eval_data/images/zootopia/z2.png,"In the second image, what are the main characters doing, and how do they seem to feel?","The main characters in the second image are a fox and a squirrel. They are standing in front of a counter, and the fox is holding a sign that says ""Next Window."" The squirrel is standing next to the fox, and they both seem to be looking at the sign. The fox appears to be smiling, while the squirrel's expression is not clear."
+zootopia_adventures3,Contrast the characters' moods and interactions in the two provided images.,Contrast the characters' moods and interactions in the two provided images.,"In the first image, the stuffed animals are not interacting with each other, so it is difficult to determine their moods. However, the plush elephant holding a sign that says ""Next Window"" seems to be in a neutral mood. 
+
+In the second image, the fox and the squirrel are standing in front of a counter, and the fox is holding a sign that says ""Next Window."" The squirrel is standing next to the fox, and they both seem to be looking at the sign. It is difficult to determine their moods, but they appear to be focused on the task at hand."
+zootopia_adventures3,./eval/eval_data/images/zootopia/z3.png,Imagine and narrate a hilarious situation involving the characters from the images.,"Imagine a scenario where the stuffed animals from the first image are having a conversation with the fox and the squirrel from the second image. The stuffed animals could be asking the fox and the squirrel about their experiences in the bank or the store, and the fox and the squirrel could be sharing their funny stories or mishaps while working there. The conversation could be full of humor and silliness, making it a hilarious situation to imagine."
+zootopia_adventures3,"Name the movie from which these characters are, and give a succinct summary of its plot.","Name the movie from which these characters are, and give a succinct summary of its plot.","The movie is ""The Fox and the Hound,"" and the plot revolves around the unlikely friendship between a fox and a hound. The fox, Tod, is raised by a woman who teaches him to be a thief, while the hound, Copper, is trained by a hunter to hunt foxes. When the two meet, they become friends, but their friendship is challenged when the hunter sets out to capture Tod. The hunter's dog, Chief, also joins the hunt. In the end, the hunter catches Tod, but Copper saves him, and the two friends escape together."
+zootopia_adventures3,Create a funny and unexpected scenario that could unfold between the characters in these images.,Create a funny and unexpected scenario that could unfold between the characters in these images.,"In this scenario, the stuffed animals from the first image are having a party, and the fox and the squirrel from the second image crash the party. The stuffed animals are surprised to see the fox and the squirrel, but they quickly welcome them and start to have a good time together. However, the party takes a turn when the stuffed animals realize that the fox and the squirrel are not from their movie, and they start to question their presence. The fox and the squirrel try to explain that they are from a different movie, but the stuffed animals are not convinced. The party turns into a debate about which movie is better, and the fox and the squirrel end up leaving the party feeling disappointed."
+tech_ceos1,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"The person in the first image is Steve Jobs, the CEO of Apple."
+tech_ceos1,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos1,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos1,Recall who is in the second image.,Recall who is in the second image.,"The second image shows a man holding a microphone and standing in front of a large sign that says ""Toshiba."""
+tech_ceos1,Recall who is in the first image.,Recall who is in the first image.,The first image shows Steve Jobs holding an iPhone.
+tech_ceos1,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is the founder of Apple, Steve Jobs."
+tech_ceos1,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the first image?,"The person in the first image is a man named Elon Musk, the founder and CEO of Tesla."
+tech_ceos2,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the second image?,"The person in the second image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the third image?,"The person in the third image is Steve Jobs, the co-founder and CEO of Apple."
+tech_ceos2,Recall who is in the second image.,Recall who is in the second image.,The second image shows a man holding up an iPhone.
+tech_ceos2,Recall who is in the first image.,Recall who is in the first image.,The first image shows a man standing in front of a microphone.
+tech_ceos2,"Is the person in the first image the founder of Apple? If not, which of the above images is the person the founder of Apple. ","Is the person in the first image the founder of Apple? If not, which of the above images is the person the founder of Apple. ",The person in the first image is not the founder of Apple. The person in the second image is the founder of Apple.
+tech_ceos2,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is the founder of Apple, Steve Jobs."
diff --git a/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set2_best_eval.csv b/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set2_best_eval.csv
new file mode 100644
index 000000000..c129fefb7
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set2_best_eval.csv
@@ -0,0 +1,94 @@
+test_name,image_path,question,answer
+cat_images1,./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg,Please describe the image in detail.,"The image shows a black cat lying on its back on a carpeted floor. The cat is relaxed and appears to be purring, as indicated by the caption. It has its paws stretched out and its eyes closed, enjoying the comfort of the carpet and the warmth of the room. The cat's fur is smooth and shiny, and its body is well-proportioned. The image captures the essence of a contented and relaxed feline moment."
+cat_images1,./eval/eval_data/images/cats/british_shorthair.jpg,Please describe the image in detail.,"The image shows a gray and white cat lying on a carpeted floor, stretching out its paws and relaxing. The cat is lying on its side, with its head and body positioned close to a carpeted staircase. The cat's paws are extended and its eyes are closed, indicating that it is comfortable and at ease in its surroundings. The cat's fur is smooth and shiny, and its body is well-proportioned. The image captures the essence of a contented feline moment."
+cat_images1,./eval/eval_data/images/cats/cat.png,Please describe the image in detail.,"The image shows a black and white cat lying on a carpeted floor, with its paws stretched out and its eyes closed. The cat is relaxed and appears to be enjoying the comfort of the carpet. The cat's fur is smooth and shiny, and its body is well-proportioned. The image captures the essence of a contented feline moment."
+cat_images1,Are the colors of the three cats the same?,Are the colors of the three cats the same?,"No, the colors of the three cats are different. One is black, one is gray and white, and one is brown and white."
+cat_images1,What are the differences between the first and third images?,What are the differences between the first and third images?,The first image shows a black cat lying on a carpeted floor while the third image shows a brown and white cat lying on a carpeted floor. The first image has a larger bounding box than the third image.
+cat_images1,What are the differences between the second and third images?,What are the differences between the second and third images?,The second image shows a black cat lying on a carpeted staircase while the third image shows a brown and white cat lying on a carpeted floor. The second image has a smaller bounding box than the third image.
+cat_images1,Is the cat in the first image in the sunshine?,Is the cat in the first image in the sunshine?,"No, the cat in the first image is not in the sunshine. It is lying on a carpeted floor."
+cat_images2,./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg,Please describe the image in detail.,"The image shows a black cat lying on its back on a carpeted floor. The cat is relaxed and appears to be purring, as indicated by the caption. It has its paws stretched out and its eyes closed, enjoying the comfort of the carpet and the warmth of the room. The cat's fur is smooth and shiny, and its body is well-proportioned. The image captures the essence of a contented and relaxed feline moment."
+cat_images2,./eval/eval_data/images/cats/british_shorthair.jpg,Please describe the image in detail.,"The image shows a gray and white cat lying on a carpeted floor, stretching out its paws and relaxing. The cat is lying on its side, with its head and body positioned close to a carpeted staircase. The cat's paws are extended and its eyes are closed, indicating that it is comfortable and at ease in its surroundings. The cat's fur is smooth and shiny, and its body is well-proportioned. The image captures the essence of a contented feline moment."
+cat_images2,What are the differences between the two images?,What are the differences between the two images?,The first image shows a black cat lying on a carpet while the second image shows a gray and white cat lying on a carpet near a staircase.
+cat_images2,./eval/eval_data/images/cats/cat.png,Please describe the image in detail.,"The image shows a black and white cat lying on a carpeted floor, with its paws stretched out and its eyes closed. The cat is relaxed and appears to be enjoying the comfort of the carpet. The cat's fur is smooth and shiny, and its body is well-proportioned. The image captures the essence of a contented feline moment."
+cat_images2,Are the colors of the three cats the same?,Are the colors of the three cats the same?,"No, the colors of the three cats are different. The first cat is black, the second cat is gray and white, and the third cat is black and white."
+cat_images2,What are the differences between the first and third images?,What are the differences between the first and third images?,The first image shows a black cat lying on a carpet while the third image shows a black and white cat lying on a carpet near a staircase.
+cat_images2,What are the differences between the second and third images?,What are the differences between the second and third images?,The second image shows a black cat lying on a carpet while the third image shows a gray and white cat lying on a carpet near a staircase.
+cat_images2,Is the cat in the first image in the sunshine?,Is the cat in the first image in the sunshine?,"No, the cat in the first image is not in the sunshine. It is lying on a carpet indoors."
+cat_images2,Which cat do you prefer and why?,Which cat do you prefer and why?,"I personally prefer the gray and white cat in the third image. It appears to be relaxed and content, with its eyes closed and paws stretched out, which gives off a sense of tranquility and comfort. The cat's fur is also well-groomed and shiny, which adds to its overall appearance."
+cat_images2,I prefer the second cat. It's so cute.,I prefer the second cat. It's so cute.,"I agree, the second cat is adorable. It looks so comfortable and relaxed, with its paws stretched out and eyes closed. It's a great image of a cat enjoying a moment of peace and contentment."
+cat_images2,Then why do you prefer the third cat more?,Then why do you prefer the third cat more?,"I prefer the third cat more because it looks more relaxed and content. It is lying on the carpet with its eyes closed and paws stretched out, which gives off a sense of tranquility and comfort. The cat's fur is also well-groomed and shiny, which adds to its overall appearance."
+counting_people1,./eval/eval_data/images/friends/can-count1.jpg,Count the number of people in the image.,There are six people in the image.
+counting_people1,./eval/eval_data/images/friends/can-count2.jpg,Count the number of people in the image.,There are six people in the image.
+counting_people1,What are the differences between the two images? Are they the same group of people? Explain why.,What are the differences between the two images? Are they the same group of people? Explain why.,"The two images show different groups of people. In the first image, there are six people posing for a photo, while in the second image, there are six people standing together, but they are not posing for a photo. The first image shows the same group of people in different poses, while the second image shows a different group of people standing together."
+counting_people1,Are you familiar with this TV series? Can you name the characters shown in the provided images? Who are they?,Are you familiar with this TV series? Can you name the characters shown in the provided images? Who are they?,"Yes, I am familiar with the TV series ""The Simpsons."" The characters shown in the provided images are Homer Simpson, Marge Simpson, Bart Simpson, Lisa Simpson, and Maggie Simpson."
+counting_people2,./eval/eval_data/images/friends/can-count1.jpg,How many individuals are depicted in the image?,There are six individuals depicted in the image.
+counting_people2,./eval/eval_data/images/friends/can-count2.jpg,How many individuals can you see in the second image?,There are six individuals in the second image.
+counting_people2,Can you spot any differences between these two images? Do they represent the same set of people? Please provide a rationale.,Can you spot any differences between these two images? Do they represent the same set of people? Please provide a rationale.,"The first image shows a group of people standing in front of a building, while the second image shows a group of people sitting on a couch. The first image does not provide any information about the people's identities, while the second image shows six individuals with different appearances and clothing. The rationale is that the first image is a general group photo, while the second image is a more specific setting with individual portraits."
+counting_people2,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,"Yes, the images depict characters from the popular TV show ""Friends."" The characters are (from left to right) Monica Geller, Rachel Green, Phoebe Buffay, Joey Tribbiani, Chandler Bing, and Ross Geller."
+counting_people3,./eval/eval_data/images/friends/wrong-count1.jpg,Count the number of people in the image.,There are five people in the image.
+counting_people3,./eval/eval_data/images/friends/wrong-count2.jpg,Count the number of people in the image.,There are five people in the image.
+counting_people3,What are the differences between the two images? Are they the same group of people? Explain why.,What are the differences between the two images? Are they the same group of people? Explain why.,"The two images show different groups of people. In the first image, there are five people sitting on a bench, while in the second image, there are five people standing in front of a building. The people in the first image are wearing casual clothes, while the people in the second image are wearing formal attire. The first image shows a group of friends sitting together, while the second image shows a group of people posing for a photo in front of a building. The first image does not show any buildings or background, while the second image has a building in the background. The first image does not have any formal attire, while the second image has formal attire. The first image does not have any posing, while the second image has people posing for a photo. The first image does not have any buildings or background, while the second image has a building in the background. The first image does not have any formal attire, while the second image has formal attire. The first image does not have any posing, while the second image has people posing for a photo. The first image does not have any buildings or background, while the second image has a building in the background. The first image does not have any"
+counting_people4,./eval/eval_data/images/friends/wrong-count1.jpg,How many individuals are depicted in the image?,There are four individuals depicted in the image.
+counting_people4,./eval/eval_data/images/friends/wrong-count2.jpg,How many individuals are depicted in the image?,There are four individuals depicted in the image.
+counting_people4,Can you spot any differences between these two images?  Do they represent the same set of people? Please provide a rationale.,Can you spot any differences between these two images?  Do they represent the same set of people? Please provide a rationale.,"The two images depict different sets of people. In the first image, there are four people sitting on a bench, while in the second image, there are four people standing in front of a building. The first image shows a group of friends posing for a picture, while the second image appears to be a group of people posing for a promotional photo or a public event."
+counting_people4,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,"I'm sorry, I'm not familiar with the TV show you're referring to. The characters shown in the provided images are not identified, and I cannot provide any information about them."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z1.png,Please describe the image in detail.,"The image shows a group of stuffed animals sitting at a desk in a room. There are three stuffed animals, including a teddy bear, a dog, and a cat. The teddy bear is sitting in the middle of the desk, while the dog and cat are sitting on either side of the teddy bear. The room is decorated with a rug on the floor and a picture hanging on the wall."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z2.png,Please describe the image in detail.,"The image shows a group of stuffed animals sitting at a desk in a room. The stuffed animals are wearing clothes and are sitting on chairs. There are three stuffed animals in the scene, including a teddy bear, a dog, and a cat. The room is decorated with a rug on the floor and a picture hanging on the wall."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z3.png,Can you name the characters in the images? Who are they? What are they doing?,"The characters in the images are a stuffed animal, a dog, and a cat. They are sitting at a desk in a room."
+zootopia_adventures1,"You are an imaginative storyteller. Create a fascinating story based on the first, second and third image.","You are an imaginative storyteller. Create a fascinating story based on the first, second and third image.","Once upon a time, in a cozy room filled with stuffed animals, a teddy bear, a dog, and a cat sat at a desk. They were having a meeting to discuss their plans for the day. The teddy bear, being the leader of the group, suggested they go on an adventure to explore the nearby forest. The dog, being the most adventurous of the group, was excited about the idea and immediately agreed. The cat, on the other hand, was a bit hesitant but eventually agreed to join them.
+The group packed their things and set out on their journey. As they walked through the forest, they encountered various animals and plants. The dog was always eager to explore and sniff around, while the teddy bear and cat kept a close eye on their surroundings. They came across a stream, where they decided to take a break and have a picnic.
+While they were enjoying their meal, they heard a strange noise coming from the other side of the stream. The dog, being curious, jumped into the water to investigate. The teddy bear and cat were worried about him, but he quickly returned, wagging his tail and barking excited"
+zootopia_adventures1,Are you familiar with these characters? What movie are they from?,Are you familiar with these characters? What movie are they from?,"The characters in the images are from the movie ""Zootopia."""
+zootopia_adventures1,Can you name the characters in the images? Who are they?,Can you name the characters in the images? Who are they?,"The characters in the images are a stuffed animal, a dog, and a cat."
+zootopia_adventures1,In what type of environment or setting do these characters live? Describe it.,In what type of environment or setting do these characters live? Describe it.,The characters live in a room filled with stuffed animals.
+zootopia_adventures2,./eval/eval_data/images/zootopia/z1.png ./eval/eval_data/images/zootopia/z2.png ./eval/eval_data/images/zootopia/z3.png,Create an engaging story strictly based on the images.,"The teddy bear and the stuffed fox were sitting in a waiting room, patiently waiting for their turn to be seen by the doctor. They were surrounded by other stuffed animals, all waiting their turn for a check-up.
+The teddy bear was feeling a bit under the weather, with a slight cough and a runny nose. The fox, being a good friend, was trying to comfort the teddy bear and make him feel better.
+As they waited, they noticed a group of other stuffed animals gathered around a television screen. The news was on, and the stuffed animals were watching with great interest. The teddy bear and the fox joined the group, curious to know what was happening.
+Suddenly, the news anchor announced that a new children's hospital was opening in the city, and they were showing footage of the construction site. The teddy bear and the fox were excited to hear about the new hospital, as it would provide better facilities and care for all the stuffed animals in need.
+As the news segment ended, the teddy bear and the fox returned to their seats, waiting patiently for their turn to be seen by the doctor. They knew that with"
+zootopia_adventures2,Do you recognize the setting or the characters in these images? Name the movie.,Do you recognize the setting or the characters in these images? Name the movie.,"The setting and characters in these images resemble the movie ""Toy Story."""
+zootopia_adventures2,Can you share some interesting facts or details about the characters shown in the images?,Can you share some interesting facts or details about the characters shown in the images?,"Sure, here are some interesting facts about the characters in the movie ""Toy Story"":
+1. Woody, the cowboy doll, is the leader of the toys and the protagonist of the movie. He is voiced by Tom Hanks.
+2. Buzz Lightyear, the space ranger doll, is Woody's rival and best friend. He is voiced by Tim Allen.
+3. Mr. Potato Head is a versatile toy who can be assembled in various ways, with different parts that can be attached to his body. He is voiced by Don Rickles.
+4. Slinky Dog is a happy-go-lucky dog toy who is always eager to help Woody and the other toys. He is voiced by Jim Varney.
+5. Rex, the green dinosaur toy, is a nervous and excitable character who often gets scared easily. He is voiced by Wallace Shawn.
+6. Hamm, the piggy bank toy, is a wise-cracking character who often provides comic relief. He is voiced by John Ratzenberger.
+7. Bo Peep, the por"
+zootopia_adventures2,Which character do you find the most intriguing and why?,Which character do you find the most intriguing and why?,"I find the character of Mr. Potato Head to be the most intriguing in the movie ""Toy Story."" He is a versatile toy who can be assembled in various ways, with different parts that can be attached to his body. He is also a wise and humorous character, often providing insight and humor to the story. His ability to adapt and change his appearance makes him unique among the other toys, and his relationship with Woody is one of the central themes of the movie."
+zootopia_adventures2,"Based on the images, can you create some dialogues that the characters might say to each other in these situations?","Based on the images, can you create some dialogues that the characters might say to each other in these situations?","Sure, here are some possible dialogues between the characters in the images:
+1. Woody and Buzz Lightyear:
+Woody: ""Hey, Buzz, what's up?""
+Buzz Lightyear: ""Not much, Woody. Just waiting for our turn to be seen by the doctor.""
+Woody: ""Yeah, I'm feeling a bit under the weather myself. Maybe we should take it easy today.""
+Buzz Lightyear: ""Sure thing, Woody. Let's just relax and try to stay calm.""
+2. Woody and Mr. Potato Head:
+Woody: ""Hey, Mr. Potato Head, how's it going?""
+Mr. Potato Head: ""Not bad, Woody. Just waiting for our turn to be seen by the doctor.""
+Woody: ""Yeah, I'm feeling a bit under the weather myself. Do you think we should ask for a different doctor?""
+Mr. Potato Head: ""I don't know, Woody. Maybe we should just give this one a chance. After all, they're professionals.""
+Woody: """
+zootopia_adventures3,./eval/eval_data/images/zootopia/z1.png,Examine and describe the characters' actions in the first image.,"In the first image, two characters are sitting on a couch, one of them holding a cup of coffee. The other character is holding a newspaper, and they appear to be relaxing and enjoying their time together."
+zootopia_adventures3,./eval/eval_data/images/zootopia/z2.png,"In the second image, what are the main characters doing, and how do they seem to feel?","In the second image, the main characters are standing in a room with a receptionist. They appear to be talking to the receptionist, and they seem to be in a friendly and casual mood."
+zootopia_adventures3,Contrast the characters' moods and interactions in the two provided images.,Contrast the characters' moods and interactions in the two provided images.,"In the first image, the two characters are relaxed and enjoying their time together, sharing a cup of coffee and a newspaper. They appear to be in a comfortable and peaceful setting.
+In contrast, the second image shows the main characters interacting with a receptionist in a more formal setting. They seem to be engaging in a conversation with the receptionist, possibly inquiring about something or seeking assistance. The mood is more casual and friendly, but there is a sense of formality due to the presence of the receptionist and the setting."
+zootopia_adventures3,./eval/eval_data/images/zootopia/z3.png,Imagine and narrate a hilarious situation involving the characters from the images.,"In the first image, the two characters are enjoying a cup of coffee and a newspaper while relaxing on a couch. Suddenly, a wild and crazy character from the second image bursts into the scene, causing a commotion. The main characters are startled and try to figure out what's going on. The wild character, who is a stuffed animal, is dressed in a suit and tie and is holding a newspaper. He is trying to get their attention, but they are too surprised to respond. The wild character starts to dance and sing, making the situation even more chaotic. The main characters eventually join in the fun, and the scene turns into a hilarious dance party. The wild character's antics bring a new level of excitement and humor to the scene, making it a memorable and entertaining moment."
+zootopia_adventures3,"Name the movie from which these characters are, and give a succinct summary of its plot.","Name the movie from which these characters are, and give a succinct summary of its plot.","The characters are from the movie ""Zootopia,"" which tells the story of a city where animals of all species coexist and work together. The main characters are a rabbit named Judy Hopps, who becomes a police officer, and a fox named Nick Wilde, who is a con artist. Together, they work to solve a mystery involving the disappearance of several animals and uncover a conspiracy that threatens the harmony of Zootopia. The movie explores themes of prejudice, trust, and the importance of working together to overcome challenges."
+zootopia_adventures3,Create a funny and unexpected scenario that could unfold between the characters in these images.,Create a funny and unexpected scenario that could unfold between the characters in these images.,"In a hilarious and unexpected twist, the main characters from the first image, the two relaxed characters sitting on a couch, suddenly find themselves transported into the world of the second image, where they encounter the wild and crazy stuffed animal in a suit. The two characters are taken aback by this unexpected turn of events and try to make sense of their new surroundings.
+The stuffed animal, who is dressed in a suit and holding a newspaper, starts to dance and sing, causing the two main characters to join in the fun. However, as they dance and sing, they realize that they are not in their usual environment and that they need to find a way back to their original world.
+The two main characters and the wild stuffed animal embark on a journey to find a way back to their respective worlds, encountering various obstacles and challenges along the way. They eventually discover a portal that allows them to return to their original world, where they are relieved to be back in their familiar surroundings.
+The unexpected and hilarious adventure that the two main characters and the wild stuffed animal experience serves as a reminder that sometimes the most unexpected things can happen, and that"
+tech_ceos1,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"This person is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos1,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos1,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is Elon Musk, the CEO of Tesla and SpaceX."
+tech_ceos1,Recall who is in the second image.,Recall who is in the second image.,"The person in the second image is Bill Gates, the co-founder of Microsoft."
+tech_ceos1,Recall who is in the first image.,Recall who is in the first image.,"The person in the first image is Steve Jobs, the co-founder of Apple."
+tech_ceos1,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos1,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is Elon Musk, the founder of Tesla and SpaceX."
+tech_ceos2,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the first image?,"The person in the first image is Elon Musk, the CEO of Tesla."
+tech_ceos2,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the second image?,"The person in the second image is Bill Gates, the founder of Microsoft."
+tech_ceos2,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the third image?,"The person in the third image is Elon Musk, the CEO of Tesla."
+tech_ceos2,Recall who is in the second image.,Recall who is in the second image.,"The person in the second image is Bill Gates, the founder of Microsoft."
+tech_ceos2,Recall who is in the first image.,Recall who is in the first image.,"The person in the first image is Elon Musk, the CEO of Tesla."
+tech_ceos2,"Is the person in the first image the founder of Apple? If not, which of the above images is the person the founder of Apple. ","Is the person in the first image the founder of Apple? If not, which of the above images is the person the founder of Apple. ",The person in the first image is not the founder of Apple. The person in the second image is the founder of Microsoft.
+tech_ceos2,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is not the founder of Apple."
diff --git a/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set2_final.csv b/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set2_final.csv
new file mode 100644
index 000000000..dc800fc89
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/results/eval_comprehensive/ours-set2_final.csv
@@ -0,0 +1,103 @@
+test_name,image_path,question,answer
+cat_images1,./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg,Please describe the image in detail.,"The image shows a black cat lying on its back on a carpeted floor. The cat is relaxed and appears to be purring, as indicated by the caption. It has its paws stretched out and its eyes closed, enjoying the comfort of the carpet and possibly receiving some love or attention from its owner. The cat's fur is smooth and shiny, and its body is well-proportioned. The image conveys a sense of peace and contentment, highlighting the bond between humans and their pets."
+cat_images1,./eval/eval_data/images/cats/british_shorthair.jpg,Please describe the image in detail.,"The image shows a gray and white cat lying on a carpeted floor, stretching out its paws and relaxing. The cat is lying on its side, with its eyes closed, and appears to be enjoying a comfortable and peaceful moment. The cat's fur is well-groomed, and it is surrounded by a clean and tidy environment. The image conveys a sense of tranquility and the bond between humans and their pets."
+cat_images1,./eval/eval_data/images/cats/cat.png,Please describe the image in detail.,"The image shows a black and white cat lying on a carpeted floor, surrounded by greenery. The cat is lying on its back, with its paws stretched out and its eyes closed, enjoying a moment of relaxation. The greenery in the background adds a touch of nature to the scene, creating a peaceful and serene atmosphere. The cat's fur is well-groomed, and it appears to be in a comfortable and contented state."
+cat_images1,Are the colors of the three cats the same?,Are the colors of the three cats the same?,"No, the colors of the three cats are different. One is black and white, one is gray and white, and one is brown and white."
+cat_images1,What are the differences between the first and third images?,What are the differences between the first and third images?,The first image shows a black cat lying on a carpeted floor while the third image shows a brown and white cat lying on a carpeted floor. The first image has a larger bounding box than the third image.
+cat_images1,What are the differences between the second and third images?,What are the differences between the second and third images?,The second image shows a black cat lying on its back while the third image shows a brown and white cat lying on its back. The second image has a smaller bounding box than the third image.
+cat_images1,Is the cat in the first image in the sunshine?,Is the cat in the first image in the sunshine?,"No, the cat in the first image is not in the sunshine. It is lying on a carpeted floor."
+cat_images2,./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg,Please describe the image in detail.,"The image shows a black cat lying on its back on a carpeted floor. The cat is relaxed and appears to be purring, as indicated by the caption. It has its paws stretched out and its eyes closed, enjoying the comfort of the carpet and possibly receiving some love or attention from its owner. The cat's fur is smooth and shiny, and its body is well-proportioned. The image conveys a sense of peace and contentment, highlighting the bond between humans and their pets."
+cat_images2,./eval/eval_data/images/cats/british_shorthair.jpg,Please describe the image in detail.,"The image shows a gray and white cat lying on a carpeted floor, stretching out its paws and relaxing. The cat is lying on its side, with its eyes closed, and appears to be enjoying a comfortable and peaceful moment. The cat's fur is well-groomed, and it is surrounded by a clean and tidy environment. The image conveys a sense of tranquility and the bond between humans and their pets."
+cat_images2,What are the differences between the two images?,What are the differences between the two images?,The first image shows a black cat lying on a carpet while the second image shows a gray and white cat lying on a carpet near a couch.
+cat_images2,./eval/eval_data/images/cats/cat.png,Please describe the image in detail.,"The image shows a black and white cat lying on a carpeted floor, stretching out its paws and looking up at something. The cat appears to be relaxed and enjoying its surroundings. The carpet is a light color, and the cat's fur is well-groomed. The image conveys a sense of tranquility and the bond between humans and their pets."
+cat_images2,Are the colors of the three cats the same?,Are the colors of the three cats the same?,"No, the colors of the three cats are different. The first cat is black, the second cat is gray and white, and the third cat is brown and white."
+cat_images2,What are the differences between the first and third images?,What are the differences between the first and third images?,The first image shows a black cat lying on a carpet while the third image shows a brown and white cat lying on a carpet near a couch.
+cat_images2,What are the differences between the second and third images?,What are the differences between the second and third images?,The second image shows a black cat lying on a carpet while the third image shows a gray and white cat lying on a carpet near a couch.
+cat_images2,Is the cat in the first image in the sunshine?,Is the cat in the first image in the sunshine?,"No, the cat in the first image is not in the sunshine. It is lying on a carpet indoors."
+cat_images2,Which cat do you prefer and why?,Which cat do you prefer and why?,"I personally prefer the gray and white cat in the second image. It appears to be relaxed and content, which makes it a pleasant sight to behold. Additionally, its fur is well-groomed, which adds to its overall appeal."
+cat_images2,I prefer the second cat. It's so cute.,I prefer the second cat. It's so cute.,"I agree, the second cat is very cute. It looks relaxed and content, which makes it a pleasant sight to behold. Its fur is well-groomed, which adds to its overall appeal."
+cat_images2,Then why do you prefer the third cat more?,Then why do you prefer the third cat more?,"I prefer the third cat more because it is lying on a carpet near a couch, which gives it a cozy and comfortable environment. Additionally, the cat's fur looks well-groomed, which adds to its overall appeal."
+counting_people1,./eval/eval_data/images/friends/can-count1.jpg,Count the number of people in the image.,There are six people in the image.
+counting_people1,./eval/eval_data/images/friends/can-count2.jpg,Count the number of people in the image.,There are six people in the image.
+counting_people1,What are the differences between the two images? Are they the same group of people? Explain why.,What are the differences between the two images? Are they the same group of people? Explain why.,"The two images show different groups of people. In the first image, there are six people posing for a photo, while in the second image, there are six people standing together, but they are not posing for a photo. The first image shows the group standing in a line, while the second image shows them standing in a group. The first image has a black background, while the second image has a white background. The first image has a caption that says ""Friends,"" while the second image does not have any caption. The first image has a larger size than the second image."
+counting_people1,Are you familiar with this TV series? Can you name the characters shown in the provided images? Who are they?,Are you familiar with this TV series? Can you name the characters shown in the provided images? Who are they?,"Yes, I am familiar with the TV series ""Friends."" The characters shown in the images are:
+1. Ross Geller, who is a paleontologist and a friend of the group.
+2. Rachel Green, who is a fashionista and a friend of the group.
+3. Monica Geller, who is a chef and a friend of the group.
+4. Chandler Bing, who is a sarcastic and witty friend of the group.
+5. Joey Tribbiani, who is a struggling actor and a friend of the group.
+6. Phoebe Buffay, who is a free-spirited and eccentric friend of the group."
+counting_people2,./eval/eval_data/images/friends/can-count1.jpg,How many individuals are depicted in the image?,There are six individuals depicted in the image.
+counting_people2,./eval/eval_data/images/friends/can-count2.jpg,How many individuals can you see in the second image?,There are six individuals in the second image.
+counting_people2,Can you spot any differences between these two images? Do they represent the same set of people? Please provide a rationale.,Can you spot any differences between these two images? Do they represent the same set of people? Please provide a rationale.,"The first image shows a group of people standing together with the word ""Friends"" written on the bottom, while the second image shows a group of six individuals posing for a picture. It is not clear if the two images represent the same set of people or not."
+counting_people2,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,"Unfortunately, I am not familiar with the TV show depicted in the provided images. I cannot identify the characters or provide any additional information about the show."
+counting_people3,./eval/eval_data/images/friends/wrong-count1.jpg,Count the number of people in the image.,There are four people in the image.
+counting_people3,./eval/eval_data/images/friends/wrong-count2.jpg,Count the number of people in the image.,There are four people in the image.
+counting_people3,What are the differences between the two images? Are they the same group of people? Explain why.,What are the differences between the two images? Are they the same group of people? Explain why.,"The two images appear to feature different groups of people. In the first image, there are four people sitting on a bench, while in the second image, there are four people standing in front of a building. The people in the first image are wearing casual clothes, while the people in the second image are wearing more formal attire. Additionally, the first image shows a fountain in the background, while the second image does not. These differences suggest that the two images depict different groups of people in different settings."
+counting_people4,./eval/eval_data/images/friends/wrong-count1.jpg,How many individuals are depicted in the image?,There are four individuals depicted in the image.
+counting_people4,./eval/eval_data/images/friends/wrong-count2.jpg,How many individuals are depicted in the image?,There are four individuals depicted in the image.
+counting_people4,Can you spot any differences between these two images?  Do they represent the same set of people? Please provide a rationale.,Can you spot any differences between these two images?  Do they represent the same set of people? Please provide a rationale.,"The two images depict different sets of people. In the first image, there are four people sitting on a bench, while in the second image, there are four people standing in front of a building. The first image shows a group of friends posing for a picture, while the second image appears to be a group of people posing for a promotional photo or a public event. The rationale behind these two images is that they represent different scenarios and groups of people."
+counting_people4,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,Do you recognize this TV show? Can you name the characters shown in the provided images? Who are they?,"Yes, I recognize this TV show as ""Friends."" The characters shown in the provided images are: 
+
+1. Ross Geller, played by David Schwimmer
+2. Monica Geller, played by Courteney Cox
+3. Phoebe Buffay, played by Lisa Kudrow
+4. Joey Tribbiani, played by Matt LeBlanc
+5. Chandler Bing, played by Matthew Perry"
+zootopia_adventures1,./eval/eval_data/images/zootopia/z1.png,Please describe the image in detail.,"The image shows a group of stuffed animals sitting at a desk in a room. The desk is located in the middle of the room, and there are two chairs in front of it. The stuffed animals are sitting at the desk, with one of them holding a cup of coffee. There are also several books and a computer monitor on the desk. The room is well-lit, and there are no people present in the scene."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z2.png,Please describe the image in detail.,"The image shows a group of stuffed animals sitting at a desk in a room. The stuffed animals are wearing clothes and are sitting in front of a computer monitor. There are several books on the desk, and the room is well-lit. The stuffed animals appear to be having a conversation or working together at the desk."
+zootopia_adventures1,./eval/eval_data/images/zootopia/z3.png,Can you name the characters in the images? Who are they? What are they doing?,The characters in the images are a stuffed animal and a raccoon. They are sitting at a desk and using a computer.
+zootopia_adventures1,"You are an imaginative storyteller. Create a fascinating story based on the first, second and third image.","You are an imaginative storyteller. Create a fascinating story based on the first, second and third image.","Once upon a time, in a magical land, there lived a stuffed animal and a raccoon. They were best friends and always found themselves in adventures. One day, they decided to start a business together. They set up a desk in a cozy corner of their home and started working on their project.
+The stuffed animal was a creative genius and came up with innovative ideas. The raccoon was a skilled computer programmer and helped bring those ideas to life. Together, they worked tirelessly to create a revolutionary product that would change the world.
+As they worked, they faced many challenges and obstacles. But with their determination and teamwork, they overcame them all. Their product was a huge success, and they became famous for their innovation and creativity.
+Their story became an inspiration to many, and their friendship was a testament to the power of teamwork and perseverance. They proved that with hard work and dedication, anything is possible, even in the most unlikely of circumstances."
+zootopia_adventures1,Are you familiar with these characters? What movie are they from?,Are you familiar with these characters? What movie are they from?,"The characters in the images are from the movie ""Zootopia."""
+zootopia_adventures1,Can you name the characters in the images? Who are they?,Can you name the characters in the images? Who are they?,"The characters in the images are Judy Hopps and Nick Wilde from the movie ""Zootopia."""
+zootopia_adventures1,In what type of environment or setting do these characters live? Describe it.,In what type of environment or setting do these characters live? Describe it.,"Judy Hopps and Nick Wilde live in the fictional city of Zootopia, which is a melting pot of different animal species. The city is a modern metropolis with various districts, including the police department where Judy works and the street where Nick runs his business. The city is known for its diverse population and its unique blend of animal cultures."
+zootopia_adventures2,./eval/eval_data/images/zootopia/z1.png ./eval/eval_data/images/zootopia/z2.png ./eval/eval_data/images/zootopia/z3.png,Create an engaging story strictly based on the images.,"The teddy bear and the stuffed fox were sitting in a waiting room, patiently waiting for their turn to be seen by the doctor. They were surrounded by other stuffed animals, all in various states of discomfort. The teddy bear had a bandage on his arm, while the fox seemed to be nursing a sore paw.
+The teddy bear turned to the fox and said, ""I hope the doctor can help us. I'm not sure what's wrong with my arm, but it's been hurting for a while."" The fox nodded in agreement and added, ""My paw is swollen and it's been hard to walk. I'm hoping the doctor can give us some relief.""
+As they waited, they observed the other stuffed animals in the room. There was a penguin with a runny nose, a giraffe with a sore throat, and a lion with a cough. The teddy bear and the fox exchanged worried glances, realizing that they were not alone in their misery.
+Suddenly, the door to the doctor's office opened, and a nurse"
+zootopia_adventures2,Do you recognize the setting or the characters in these images? Name the movie.,Do you recognize the setting or the characters in these images? Name the movie.,"The setting is a waiting room, and the characters are stuffed animals. The movie is ""Toy Story 2."""
+zootopia_adventures2,Can you share some interesting facts or details about the characters shown in the images?,Can you share some interesting facts or details about the characters shown in the images?,"Certainly! Here are some interesting facts about the characters in the images:
+1. The teddy bear in the image is a character from the movie ""Toy Story 2."" He is a stuffed bear named Lotso, who was once the beloved toy of a little girl named Andy. However, he was abandoned at a daycare center and became the leader of the toys there.
+2. The stuffed fox in the image is also a character from ""Toy Story 2."" He is a member of Lotso's gang and is known for his loyalty to Lotso.
+3. The teddy bear in the image has a bandage on his arm, indicating that he may have been injured or undergone some kind of medical treatment.
+4. The stuffed fox in the image is nursing a sore paw, suggesting that he may have sustained an injury as well.
+5. The teddy bear and the stuffed fox are both voiced by Ned Beatty in the movie ""Toy Story 2.""
+6. The teddy bear and the stuffed fox are among the many toys that come to life in the ""Toy Story"
+zootopia_adventures2,Which character do you find the most intriguing and why?,Which character do you find the most intriguing and why?,"I find the teddy bear from the movie ""Toy Story 2"" to be the most intriguing character. Despite being a toy, he has a rich backstory and a complex personality. He was once the beloved toy of a little girl named Andy, but was abandoned at a daycare center and became the leader of the toys there. His journey to find his way back to Andy is both heartwarming and adventurous. Additionally, his interactions with other characters, such as the stuffed fox and the nurse, showcase his resilience and determination. Overall, his story and character development make him a compelling and memorable character in the ""Toy Story"" franchise."
+zootopia_adventures2,"Based on the images, can you create some dialogues that the characters might say to each other in these situations?","Based on the images, can you create some dialogues that the characters might say to each other in these situations?","Sure, here are some possible dialogues between the characters in the images:
+1. Teddy Bear: ""I'm not sure what's wrong with my arm, but it's been hurting for a while.""
+Stuffed Fox: ""I'm nursing my sore paw. I think I might have stepped on something sharp.""
+Teddy Bear: ""Oh no, that sounds painful. I hope we can get some help from the doctor soon.""
+2. Teddy Bear: ""I'm worried about my arm. I don't want to miss my turn to see the doctor.""
+Stuffed Fox: ""Don't worry, we'll get our turn soon. Just try to stay calm and patient.""
+Teddy Bear: ""I know, but it's hard when we're both in pain.""
+Stuffed Fox: ""I understand. Let's just try to distract ourselves until it's our turn.""
+3. Teddy Bear: ""I'm so glad we're finally getting to see the doctor. I hope he can help us both feel better.""
+Stuffed Fox: ""Me too. I"
+zootopia_adventures3,./eval/eval_data/images/zootopia/z1.png,Examine and describe the characters' actions in the first image.,"In the first image, two cartoon characters are sitting on a bench, one of them holding a cup of coffee. They appear to be having a conversation, possibly sharing the coffee. The scene is set in a park, with trees and grass in the background."
+zootopia_adventures3,./eval/eval_data/images/zootopia/z2.png,"In the second image, what are the main characters doing, and how do they seem to feel?","The main characters in the second image are a fox and a squirrel, and they appear to be walking together. They seem to be in a good mood and are enjoying their time together."
+zootopia_adventures3,Contrast the characters' moods and interactions in the two provided images.,Contrast the characters' moods and interactions in the two provided images.,"In the first image, the two cartoon characters are sitting on a bench, one of them holding a cup of coffee. They appear to be having a conversation, possibly sharing the coffee. Their mood seems relaxed and friendly.
+In contrast, the second image features a fox and a squirrel walking together. They seem to be in a good mood and are enjoying their time together. Their interaction is more casual and playful, as they are walking side by side. The mood of the scene is light and cheerful."
+zootopia_adventures3,./eval/eval_data/images/zootopia/z3.png,Imagine and narrate a hilarious situation involving the characters from the images.,"In the second image, the fox and the squirrel are walking together, and the squirrel is carrying a cup of coffee. Suddenly, the fox gets a crazy idea and decides to play a prank on the squirrel. He grabs the cup of coffee from the squirrel's hands and drinks it all in one gulp. The squirrel is shocked and outraged, but the fox just laughs and runs away, leaving the squirrel to clean up the mess. The squirrel is furious and chases after the fox, but the fox is too fast and manages to escape. The squirrel is left with a bitter taste in her mouth and a cup of coffee spilled on the ground."
+zootopia_adventures3,"Name the movie from which these characters are, and give a succinct summary of its plot.","Name the movie from which these characters are, and give a succinct summary of its plot.","The characters are from the movie ""Zootopia,"" which is about a city where animals of all species coexist and work together. The plot revolves around a rabbit police officer and a con artist fox who team up to solve a mystery and uncover a conspiracy that threatens the harmony of the city. The movie explores themes of prejudice, discrimination, and the importance of working together despite differences."
+zootopia_adventures3,Create a funny and unexpected scenario that could unfold between the characters in these images.,Create a funny and unexpected scenario that could unfold between the characters in these images.,"In a hilarious twist, the fox and the squirrel from the second image suddenly find themselves transported to the first image, where they are sitting on the bench with the two cartoon characters. The fox and the squirrel are confused and bewildered by their new surroundings, while the two cartoon characters are equally surprised to see them.
+The fox and the squirrel try to make sense of the situation, but they are unable to comprehend the cartoon world they have entered. They attempt to communicate with the cartoon characters, but their attempts are met with confusion and amusement.
+The cartoon characters, intrigued by the fox and the squirrel, offer them coffee, but the fox and the squirrel are not familiar with the concept of coffee and refuse. The cartoon characters then offer them a cup of tea, which the fox and the squirrel happily accept.
+As they enjoy their tea, the fox and the squirrel begin to explore the cartoon world, discovering new and exciting things. They encounter various cartoon characters and participate in their adventures, creating a hilarious and unexpected storyline"
+tech_ceos1,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"This person is Steve Jobs, the co-founder and CEO of Apple."
+tech_ceos1,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos1,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is Elon Musk, the CEO of Tesla and SpaceX."
+tech_ceos1,Recall who is in the second image.,Recall who is in the second image.,"The person in the second image is Bill Gates, the co-founder of Microsoft."
+tech_ceos1,Recall who is in the first image.,Recall who is in the first image.,"The person in the first image is Steve Jobs, the co-founder of Apple."
+tech_ceos1,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos1,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is Elon Musk, the founder of Tesla and SpaceX."
+tech_ceos2,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the first image?,"The person in the first image is Elon Musk, the CEO of Tesla."
+tech_ceos2,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the second image?,"The person in the second image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the third image?,"The person in the third image is Elon Musk, the CEO of Tesla."
+tech_ceos2,Recall who is in the second image.,Recall who is in the second image.,"The person in the second image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2,Recall who is in the first image.,Recall who is in the first image.,"The person in the first image is Elon Musk, the CEO of Tesla."
+tech_ceos2,"Is the person in the first image the founder of Apple? If not, which of the above images is the person the founder of Apple. ","Is the person in the first image the founder of Apple? If not, which of the above images is the person the founder of Apple. ",The person in the first image is not the founder of Apple. The person in the second image is the founder of Microsoft.
+tech_ceos2,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is not the founder of Apple."
diff --git a/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set1_best_eval.csv b/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set1_best_eval.csv
new file mode 100644
index 000000000..2d42db0cf
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set1_best_eval.csv
@@ -0,0 +1,61 @@
+test_name,image_path,question,answer
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the image?,"The person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the image?,"The person in the image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the image?,"This person in the image is a man named Bill Gates, who is the founder of Microsoft."
+tech_ceos2.1a,Recall who is in the second image.,Recall who is in the second image.,"The second image shows a man holding a microphone and standing in front of a large sign with the word ""Tesla"" on it."
+tech_ceos2.1a,Recall who is in the first image.,Recall who is in the first image.,The first image shows Bill Gates holding a cell phone.
+tech_ceos2.1a,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","The person in the first image is not the founder of Apple, but the founder of Microsoft, Bill Gates."
+tech_ceos2.1a,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is the founder of Apple, Steve Jobs."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image?,"The person in this image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image?,"This person is Elon Musk, the CEO of Tesla."
+tech_ceos2.1b,Recall who is in the second image.,Recall who is in the second image.,Elon Musk is in the second image holding a microphone.
+tech_ceos2.1b,Recall who is in the first image.,Recall who is in the first image.,Bill Gates is in the first image standing next to a Microsoft sign.
+tech_ceos2.1b,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Bill Gates is the founder of Microsoft, not Apple."
+tech_ceos2.1b,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is the founder of Apple, Steve Jobs."
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image a?,"The person in this image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image b?,This person in this image b is Bill Gates.
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image c?,"This person in this image c is a man wearing a suit and tie, and he is holding a microphone."
+tech_ceos2.1c,Recall who is in image b.,Recall who is in image b.,The person in image b is a man wearing a suit and tie holding a microphone.
+tech_ceos2.1c,Recall who is in the image a.,Recall who is in the image a.,The person in image a is Steve Jobs.
+tech_ceos2.1c,"Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the image a is the founder of Apple."
+tech_ceos2.1c,Is the person in the image c the founder of Apple?,Is the person in the image c the founder of Apple?,"No, the person in the image c is not the founder of Apple."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"The person in the first image is Steve Jobs, the CEO of Apple."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is a man wearing a suit and tie, and he is holding a microphone."
+tech_ceos2.1d,Recall who is in the second image.,Recall who is in the second image.,The second image shows a man holding a microphone in front of a large sign.
+tech_ceos2.1d,Recall who is in the first image.,Recall who is in the first image.,The first image shows Steve Jobs standing in front of a podium.
+tech_ceos2.1d,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is the founder of Apple."
+tech_ceos2.1d,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is not the founder of Apple."
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the image?,"The person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the image?,"The person in the image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1aa,What's the differnce between the first and second image,What's the differnce between the first and second image,The first image shows Steve Jobs holding an iPhone while the second image shows Bill Gates standing at a podium.
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the image?,"This person in the image is a man wearing a suit and tie, and he is holding a microphone."
+tech_ceos2.1aa,Recall who is in the second image.,Recall who is in the second image.,The second image shows a man holding a microphone and standing in front of a Tesla car.
+tech_ceos2.1aa,Recall who is in the first image.,Recall who is in the first image.,The first image shows Steve Jobs holding an iPhone.
+tech_ceos2.1aa,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","The person in the first image is the founder of Apple, Steve Jobs."
+tech_ceos2.1aa,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is the founder of Apple, Elon Musk."
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image?,"The person in this image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1bb,What's the differnce between the first and second images,What's the differnce between the first and second images,The first image shows Steve Jobs holding an iPhone while the second image shows Bill Gates standing at a podium.
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image?,"This person is Elon Musk, the CEO of Tesla."
+tech_ceos2.1bb,Recall who is in the second image.,Recall who is in the second image.,Elon Musk is in the second image holding a microphone.
+tech_ceos2.1bb,Recall who is in the first image.,Recall who is in the first image.,Steve Jobs is in the first image holding an iPhone.
+tech_ceos2.1bb,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Steve Jobs is the founder of Apple, and he is in the first image holding an iPhone."
+tech_ceos2.1bb,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is the founder of Apple, Steve Jobs."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image a?,"The person in this image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image b?,This person in this image b is Bill Gates.
+tech_ceos2.1cc,What's the differnce between the image a and image b,What's the differnce between the image a and image b,"In image a, Steve Jobs is holding an iPhone while in image b, Bill Gates is standing behind a podium."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image c?,"This person in this image c is a man wearing a suit and tie, and he is holding an iPhone in his hand."
+tech_ceos2.1cc,Recall who is in image b.,Recall who is in image b.,The person in image b is a man wearing a suit and tie.
+tech_ceos2.1cc,Recall who is in the image a.,Recall who is in the image a.,Steve Jobs is in the image a.
+tech_ceos2.1cc,"Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","The person in the image c is the founder of Apple, while the person in the image a is the CEO of Apple."
+tech_ceos2.1cc,Is the person in the image c the founder of Apple?,Is the person in the image c the founder of Apple?,"Yes, the person in the image c is the founder of Apple."
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"The person in the first image is Steve Jobs, the CEO of Apple."
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1dd,What's the differnce between the first and second images,What's the differnce between the first and second images,The first image is of Steve Jobs holding an iPhone while the second image is of Bill Gates standing at a podium.
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is a man wearing a suit and tie, and he is holding an iPhone in his hand."
+tech_ceos2.1dd,Recall who is in the second image.,Recall who is in the second image.,The second image shows a man holding an iPhone in his hand.
+tech_ceos2.1dd,Recall who is in the first image.,Recall who is in the first image.,The first image shows Steve Jobs holding an iPhone.
+tech_ceos2.1dd,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","The person in the first image is the founder of Apple, Steve Jobs."
+tech_ceos2.1dd,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is the founder of Apple, Steve Jobs."
diff --git a/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set1_final.csv b/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set1_final.csv
new file mode 100644
index 000000000..e717653d4
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set1_final.csv
@@ -0,0 +1,61 @@
+test_name,image_path,question,answer
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the image?,"The person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the image?,"This person in the image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the image?,"This person in the image is a man named Bill Gates, who is the co-founder of Microsoft."
+tech_ceos2.1a,Recall who is in the second image.,Recall who is in the second image.,The second image shows a man named Bill Gates holding a microphone.
+tech_ceos2.1a,Recall who is in the first image.,Recall who is in the first image.,The first image shows Bill Gates holding a cell phone.
+tech_ceos2.1a,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","The person in the first image is the founder of Microsoft, Bill Gates. The person in the second image is also the founder of Microsoft, Bill Gates."
+tech_ceos2.1a,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is not the founder of Apple. The person in the third image is a man named Bill Gates, who is the founder of Microsoft."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image?,"The person in this image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image?,"This person is the famous entrepreneur and former Apple CEO, Steve Jobs."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image?,"This person is Elon Musk, the CEO of Tesla."
+tech_ceos2.1b,Recall who is in the second image.,Recall who is in the second image.,Elon Musk is in the second image holding a microphone.
+tech_ceos2.1b,Recall who is in the first image.,Recall who is in the first image.,Steve Jobs is in the first image standing behind a podium.
+tech_ceos2.1b,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.",Steve Jobs is the founder of Apple and is in the first image standing behind a podium.
+tech_ceos2.1b,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is Elon Musk, the founder of Tesla and SpaceX."
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image a?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image b?,This person in the image b is Bill Gates.
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image c?,"This person in this image c is a man wearing a suit and tie, and he is holding a microphone."
+tech_ceos2.1c,Recall who is in image b.,Recall who is in image b.,The person in image b is a man holding a microphone.
+tech_ceos2.1c,Recall who is in the image a.,Recall who is in the image a.,The person in image a is Steve Jobs.
+tech_ceos2.1c,"Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the image a is the founder of Apple."
+tech_ceos2.1c,Is the person in the image c the founder of Apple?,Is the person in the image c the founder of Apple?,"No, the person in the image c is not the founder of Apple."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"The person in the first image is Steve Jobs, the CEO of Apple."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1d,Recall who is in the second image.,Recall who is in the second image.,"The second image shows a man holding a microphone and standing in front of a large sign that says ""Toshiba."""
+tech_ceos2.1d,Recall who is in the first image.,Recall who is in the first image.,The first image shows Steve Jobs holding an iPhone.
+tech_ceos2.1d,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is the founder of Apple, Steve Jobs."
+tech_ceos2.1d,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the image?,"The person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the image?,"This person in the image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1aa,What's the differnce between the first and second image,What's the differnce between the first and second image,The first image is of Steve Jobs holding an iPhone while the second image is of Bill Gates standing at a podium.
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the image?,"This person in the image is a man named Elon Musk, the founder and CEO of Tesla."
+tech_ceos2.1aa,Recall who is in the second image.,Recall who is in the second image.,"In the second image, there are two people, one of whom is Elon Musk holding a microphone."
+tech_ceos2.1aa,Recall who is in the first image.,Recall who is in the first image.,"In the first image, there is only one person, Steve Jobs, holding an iPhone."
+tech_ceos2.1aa,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","The person in the first image is the founder of Apple, Steve Jobs."
+tech_ceos2.1aa,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is the founder of Apple, Steve Jobs."
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image?,"The person in this image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image?,"This person is the famous entrepreneur and former Apple CEO, Steve Jobs."
+tech_ceos2.1bb,What's the differnce between the first and second images,What's the differnce between the first and second images,"In the first image, Steve Jobs is holding up an iPhone while standing, while in the second image, he is sitting behind a podium."
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image?,"This person is Elon Musk, the CEO of Tesla."
+tech_ceos2.1bb,Recall who is in the second image.,Recall who is in the second image.,"Elon Musk is in the second image, holding a microphone and standing behind a podium."
+tech_ceos2.1bb,Recall who is in the first image.,Recall who is in the first image.,"Steve Jobs is in the first image, holding up an iPhone."
+tech_ceos2.1bb,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Steve Jobs is the founder of Apple, and he is in the first image holding up an iPhone."
+tech_ceos2.1bb,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is Steve Jobs, the founder of Apple."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image a?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image b?,This person in the image b is Bill Gates.
+tech_ceos2.1cc,What's the differnce between the image a and image b,What's the differnce between the image a and image b,"In image a, Steve Jobs is holding an iPhone while in image b, Bill Gates is standing behind a microphone."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image c?,"This person in this image c is a man wearing a suit and tie, and he is holding a microphone."
+tech_ceos2.1cc,Recall who is in image b.,Recall who is in image b.,The person in image b is a man wearing a suit and tie holding a microphone.
+tech_ceos2.1cc,Recall who is in the image a.,Recall who is in the image a.,Steve Jobs is in the image a holding an iPhone.
+tech_ceos2.1cc,"Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the image a is the founder of Apple."
+tech_ceos2.1cc,Is the person in the image c the founder of Apple?,Is the person in the image c the founder of Apple?,"No, the person in the image c is not the founder of Apple."
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"The person in the first image is Steve Jobs, the CEO of Apple."
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1dd,What's the differnce between the first and second images,What's the differnce between the first and second images,The first image is of Steve Jobs holding an iPhone while the second image is of Bill Gates standing at a podium.
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is a man named Bill Gates, who is the founder of Microsoft."
+tech_ceos2.1dd,Recall who is in the second image.,Recall who is in the second image.,The second image shows a man holding an iPhone.
+tech_ceos2.1dd,Recall who is in the first image.,Recall who is in the first image.,The first image shows Steve Jobs holding an iPhone.
+tech_ceos2.1dd,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is the founder of Apple, Steve Jobs."
+tech_ceos2.1dd,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is the founder of Apple, Bill Gates."
diff --git a/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set2_best_eval.csv b/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set2_best_eval.csv
new file mode 100644
index 000000000..1f77dfbe4
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set2_best_eval.csv
@@ -0,0 +1,61 @@
+test_name,image_path,question,answer
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1a,Recall who is in the second image.,Recall who is in the second image.,"In the second image, Steve Jobs is standing in front of a large red Tesla logo, holding a microphone and speaking to a crowd. He is wearing a black suit and tie, and his hair is gray."
+tech_ceos2.1a,Recall who is in the first image.,Recall who is in the first image.,"In the first image, Steve Jobs is holding up an iPhone, which is white on the front and has a black screen. He is wearing a black shirt and is standing in front of a white background."
+tech_ceos2.1a,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos2.1a,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is not Steve Jobs, the founder of Apple."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image?,"This person is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image?,"This person is Elon Musk, the CEO of Tesla."
+tech_ceos2.1b,Recall who is in the second image.,Recall who is in the second image.,"In the second image, there is a man standing in front of a Tesla car. He is wearing a suit and tie, and he is holding a microphone."
+tech_ceos2.1b,Recall who is in the first image.,Recall who is in the first image.,"In the first image, there is a man standing in front of a Tesla car. He is wearing a black suit and tie, and he is holding a cell phone in his hand."
+tech_ceos2.1b,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.",The person in the first image is not the founder of Apple. The person in the second image is the founder of Apple.
+tech_ceos2.1b,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is the founder of Apple."
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image a?,"This person in this image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image b?,"This person in this image b is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image c?,"This person in this image c is Steve Jobs, the co-founder of Apple."
+tech_ceos2.1c,Recall who is in image b.,Recall who is in image b.,"In image b, there are several people, including a man in a suit and tie, a woman in a black dress, and a man in a suit and tie holding a microphone."
+tech_ceos2.1c,Recall who is in the image a.,Recall who is in the image a.,"In the image a, there is a man in a suit and tie standing in front of a podium."
+tech_ceos2.1c,"Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the image a is the founder of Apple, Steve Jobs."
+tech_ceos2.1c,Is the person in the image c the founder of Apple?,Is the person in the image c the founder of Apple?,"Yes, the person in the image c is the founder of Apple, Steve Jobs."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"This person is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is Elon Musk, the CEO of Tesla and SpaceX."
+tech_ceos2.1d,Recall who is in the second image.,Recall who is in the second image.,"The person in the second image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1d,Recall who is in the first image.,Recall who is in the first image.,"The person in the first image is Steve Jobs, the co-founder of Apple."
+tech_ceos2.1d,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos2.1d,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is Elon Musk, the founder of Tesla and SpaceX."
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1aa,What's the differnce between the first and second image,What's the differnce between the first and second image,"In the first image, Steve Jobs is holding up a cell phone while in the second image, he is standing at a podium."
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1aa,Recall who is in the second image.,Recall who is in the second image.,"In the second image, Steve Jobs is standing at a podium and holding a microphone. He is wearing a black suit and tie, and his hair is gray. He is speaking to a crowd of people, possibly at a conference or event."
+tech_ceos2.1aa,Recall who is in the first image.,Recall who is in the first image.,"In the first image, Steve Jobs is holding up a cell phone, which is white on the front and has a black border. He is wearing a black shirt and is standing in front of a white background."
+tech_ceos2.1aa,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos2.1aa,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is not Steve Jobs, the founder of Apple."
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image?,"This person is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1bb,What's the differnce between the first and second images,What's the differnce between the first and second images,The first image is of Steve Jobs holding up an iPhone while the second image is of Bill Gates holding up a cell phone.
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image?,"This person is Steve Jobs, the co-founder of Apple."
+tech_ceos2.1bb,Recall who is in the second image.,Recall who is in the second image.,"The second image shows Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1bb,Recall who is in the first image.,Recall who is in the first image.,"The first image shows Steve Jobs, the co-founder of Apple, holding up an iPhone."
+tech_ceos2.1bb,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","The person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos2.1bb,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is Bill Gates, the founder of Microsoft."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image a?,"This person in this image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image b?,"This person in this image b is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1cc,What's the differnce between the image a and image b,What's the differnce between the image a and image b,The image a is of Steve Jobs holding a cell phone while the image b is of Bill Gates standing in front of a podium.
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image c?,"This person in this image c is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1cc,Recall who is in image b.,Recall who is in image b.,"The person in image b is Steve Jobs, the co-founder of Apple."
+tech_ceos2.1cc,Recall who is in the image a.,Recall who is in the image a.,"The person in the image a is Steve Jobs, the co-founder of Apple."
+tech_ceos2.1cc,"Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the image a is the founder of Apple."
+tech_ceos2.1cc,Is the person in the image c the founder of Apple?,Is the person in the image c the founder of Apple?,"No, the person in the image c is not the founder of Apple. He is the co-founder of Microsoft."
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"This person is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1dd,What's the differnce between the first and second images,What's the differnce between the first and second images,The first image is of Steve Jobs holding up an iPhone while the second image is of Bill Gates standing at a podium.
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1dd,Recall who is in the second image.,Recall who is in the second image.,The second image is of Bill Gates standing at a podium.
+tech_ceos2.1dd,Recall who is in the first image.,Recall who is in the first image.,The first image is of Steve Jobs holding up an iPhone.
+tech_ceos2.1dd,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","The person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos2.1dd,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is Bill Gates, the co-founder of Microsoft."
diff --git a/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set2_final.csv b/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set2_final.csv
new file mode 100644
index 000000000..2060fa157
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/results/eval_robustness/ours-set2_final.csv
@@ -0,0 +1,61 @@
+test_name,image_path,question,answer
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1a,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1a,Recall who is in the second image.,Recall who is in the second image.,"In the second image, Steve Jobs is standing in front of a large red Apple logo, holding a microphone and speaking to a crowd. He is wearing a black suit and tie, and his hair is gray."
+tech_ceos2.1a,Recall who is in the first image.,Recall who is in the first image.,"In the first image, Steve Jobs is holding up an iPhone, which is white on the front and back, and has a black screen. He is wearing a black shirt and is standing in front of a white background."
+tech_ceos2.1a,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos2.1a,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is not Steve Jobs, the founder of Apple."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image?,"This person is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image?,"This person is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1b,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image?,"This person is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1b,Recall who is in the second image.,Recall who is in the second image.,"In the second image, Steve Jobs is standing in front of a large red banner with the Apple logo on it. He is holding a microphone and appears to be giving a speech or presentation."
+tech_ceos2.1b,Recall who is in the first image.,Recall who is in the first image.,"In the first image, Steve Jobs is holding up an iPhone in front of a crowd of people. He is wearing a black turtleneck, jeans, and glasses, and is smiling."
+tech_ceos2.1b,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos2.1b,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"Yes, the person in the third image is Steve Jobs, the founder of Apple."
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image a?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image b?,"This person in the image b is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1c,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image c?,"This person in the image c is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1c,Recall who is in image b.,Recall who is in image b.,"In image b, Steve Jobs is standing in front of a microphone and a podium, wearing a black turtleneck and jeans. He is holding a microphone and appears to be giving a speech or presentation."
+tech_ceos2.1c,Recall who is in the image a.,Recall who is in the image a.,"In the image a, Steve Jobs is standing in front of a podium, wearing a suit and tie, and holding a microphone. He is surrounded by people, and the image appears to be from a news conference or event."
+tech_ceos2.1c,"Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the image a is the founder of Apple."
+tech_ceos2.1c,Is the person in the image c the founder of Apple?,Is the person in the image c the founder of Apple?,"Yes, the person in the image c is the founder of Apple."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"This person is Steve Jobs, the co-founder and CEO of Apple."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1d,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is Elon Musk, the CEO of Tesla and SpaceX."
+tech_ceos2.1d,Recall who is in the second image.,Recall who is in the second image.,"The person in the second image is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1d,Recall who is in the first image.,Recall who is in the first image.,"The person in the first image is Steve Jobs, the co-founder of Apple."
+tech_ceos2.1d,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos2.1d,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is Elon Musk, the founder of Tesla and SpaceX."
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in the image?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1aa,What's the differnce between the first and second image,What's the differnce between the first and second image,"In the first image, Steve Jobs is holding a cell phone while in the second image, he is standing at a podium."
+tech_ceos2.1aa,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in the image?,"This person in the image is Elon Musk, the CEO of Tesla and SpaceX."
+tech_ceos2.1aa,Recall who is in the second image.,Recall who is in the second image.,"The second image shows Elon Musk standing in front of a Tesla car, holding a microphone and speaking to a crowd."
+tech_ceos2.1aa,Recall who is in the first image.,Recall who is in the first image.,The first image shows Steve Jobs holding a cell phone while standing in front of a microphone.
+tech_ceos2.1aa,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","The person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos2.1aa,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is Elon Musk, the founder of Tesla and SpaceX."
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image?,"This person is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image?,"This person is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1bb,What's the differnce between the first and second images,What's the differnce between the first and second images,"In the first image, Steve Jobs is holding up a cell phone while in the second image, he is standing at a podium."
+tech_ceos2.1bb,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image?,"This person is Elon Musk, the CEO of Tesla and SpaceX."
+tech_ceos2.1bb,Recall who is in the second image.,Recall who is in the second image.,"In the second image, Steve Jobs is holding up a cell phone, which is likely an iPhone, and is standing at a podium."
+tech_ceos2.1bb,Recall who is in the first image.,Recall who is in the first image.,"In the first image, Steve Jobs is holding up a cell phone, which is likely an iPhone, and is wearing a black turtleneck and jeans."
+tech_ceos2.1bb,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the first image is the founder of Apple."
+tech_ceos2.1bb,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is Elon Musk, the CEO of Tesla and SpaceX."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this image a?,"This person in the image is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this image b?,"This person in the image b is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1cc,What's the differnce between the image a and image b,What's the differnce between the image a and image b,"In image a, Steve Jobs is holding up a cell phone while in image b, he is standing at a podium and holding a microphone."
+tech_ceos2.1cc,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this image c?,"This person in the image c is Steve Jobs, the co-founder and former CEO of Apple."
+tech_ceos2.1cc,Recall who is in image b.,Recall who is in image b.,"In image b, Steve Jobs is standing at a podium and holding a microphone. He is wearing a black turtleneck and jeans, and he appears to be speaking to a crowd."
+tech_ceos2.1cc,Recall who is in the image a.,Recall who is in the image a.,"In the image a, Steve Jobs is holding up a cell phone, which is a black iPhone. He is wearing a black shirt and glasses, and his hair is gray."
+tech_ceos2.1cc,"Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the image a the founder of Apple? if not, which of the above images is the person the founder of Apple.","Yes, the person in the image a is the founder of Apple."
+tech_ceos2.1cc,Is the person in the image c the founder of Apple?,Is the person in the image c the founder of Apple?,"Yes, the person in the image c is the founder of Apple."
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/jobs1.jpg,Who is this person in this first image?,"This person is Steve Jobs, the co-founder and CEO of Apple."
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/gate1.jpg,Who is this person in this second image?,"This person is Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1dd,What's the differnce between the first and second images,What's the differnce between the first and second images,The first image is of Steve Jobs holding up an iPhone while the second image is of Bill Gates standing at a podium.
+tech_ceos2.1dd,./eval/eval_data/images/tech-ceo/musk1.jpg,Who is this person in this third image?,"This person is Elon Musk, the CEO of Tesla and SpaceX."
+tech_ceos2.1dd,Recall who is in the second image.,Recall who is in the second image.,"The second image is of Bill Gates, the co-founder of Microsoft."
+tech_ceos2.1dd,Recall who is in the first image.,Recall who is in the first image.,"The first image is of Steve Jobs, the co-founder of Apple."
+tech_ceos2.1dd,"Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","Is the person in the first image the founder of Apple? if not, which of the above images is the person the founder of Apple.","The person in the first image is Steve Jobs, the founder of Apple."
+tech_ceos2.1dd,Is the person in the third image the founder of Apple?,Is the person in the third image the founder of Apple?,"No, the person in the third image is Elon Musk, the founder of Tesla and SpaceX."
diff --git a/applications/DeepSpeed-VisualChat/eval/results/eval_single/ours-single_best_eval.csv b/applications/DeepSpeed-VisualChat/eval/results/eval_single/ours-single_best_eval.csv
new file mode 100644
index 000000000..a155b18cf
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/results/eval_single/ours-single_best_eval.csv
@@ -0,0 +1,24 @@
+test_name,image_path,question,answer
+cat_images1,./eval/eval_data/images/cats/cat.png,please describe the image,"The image features a small kitten standing on top of a pile of green plants, with its paws on a leaf. The kitten appears to be curious and playful, as it is reaching up to touch the leaf with its paws. The plants are spread out across the ground, creating a lush and vibrant scene. The kitten's position on the plants adds a sense of adventure and fun to the image."
+cat_images2,./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg,can you describe the image,"The image features a black cat lying on a carpeted floor, with its eyes closed and its paws stretched out. The cat appears to be relaxed and comfortable in its position. The carpeted floor is a neutral background, allowing the cat to be the main focus of the image."
+cat_images3,./eval/eval_data/images/cats/british_shorthair.jpg,please describe the image,"The image features a large gray and white cat lying on a carpeted floor. The cat is stretched out and appears to be relaxing, with its paws and legs spread out. The cat's body is positioned diagonally across the floor, taking up a significant portion of the space. The carpeted floor is a light color, providing a comfortable and cozy environment for the cat to rest in."
+extreme_ironing,./eval/eval_data/images/singles/extreme_ironing.jpg,What is unusual about this image?,"The unusual aspect of this image is that a man is standing on top of a car, ironing a shirt. This is an unconventional and potentially dangerous activity, as it puts the man at risk of falling off the car and potentially injuring himself or causing damage to the car. Ironing a shirt on top of a car is not a common practice, and it is not recommended as it poses safety hazards and can lead to accidents."
+waterview,./eval/eval_data/images/singles/waterview.jpg,What are the things I should be cautious about when I visit here?,"When visiting the wooden pier over the lake, there are a few things to be cautious about:
+
+1. Safety: Be aware of your surroundings and the condition of the pier. Make sure the pier is stable and secure before walking on it.
+
+2. Weather: Check the weather forecast before visiting to avoid any sudden changes or storms that could make the pier unsafe.
+
+3. Water depth: Be cautious of the water depth around the pier, especially if you plan to swim or wade in the lake.
+
+4. Wildlife: Be mindful of any wildlife in the area, such as fish or birds, and avoid disturbing their natural habitat.
+
+5. Personal belongings: Keep your belongings secure and close to you, as theft or loss can occur in public places.
+
+6. Sun protection: If visiting during daytime, remember to apply sunscreen and wear a hat to protect yourself from the sun's harmful rays.
+
+By being aware of these factors, you can ensure a safe and enjoyable visit to the wooden pier over the lake."
+art-dog,./eval/eval_data/images/singles/202160027_b319c4166e.jpg,can you describe the image,"The image features a large, colorful flower sculpture in the shape of a dog, standing in front of a building. The flower sculpture is made up of various flowers, including pink, red, and yellow flowers, which are attached to the dog's body. The dog sculpture is quite tall, and it is positioned in a way that it appears to be walking or standing on a sidewalk. The building in the background is a large, blue structure, which adds contrast to the vibrant flower sculpture."
+funny-phone,./eval/eval_data/images/singles/1.jpg,What is funny about this image? Describe it panel by panel.,"The image is a series of four panels that show a cell phone with a blue cord attached to it. The first panel displays the cell phone with the cord, while the second panel shows the cord being cut off. In the third panel, the cord is reattached to the cell phone, and in the fourth panel, the cord is cut off again. This sequence of events is humorous because it depicts a person repeatedly cutting and reattaching the cord to the cell phone, which is an unusual and unnecessary action. The image conveys a sense of absurdity and humor, as it highlights the pointlessness of the action."
+squirrel,./eval/eval_data/images/singles/2.jpg,Why would a person find this image funny?,"The image of a squirrel holding a camera and smiling is humorous because it is an unexpected and amusing combination of elements. Squirrels are typically associated with natural environments, such as parks and forests, and are not typically known for engaging in human activities like taking photographs. The fact that the squirrel is holding a camera and smiling, as if it is enjoying the process, adds a playful and lighthearted touch to the image. This combination of elements creates a sense of humor and amusement for viewers, as it challenges their expectations and imagination."
+art-painting,./eval/eval_data/images/singles/50.jpg,Tell me about this work of art.,"The work of art is a painting that features a scene of a man walking down a street at night, with a church in the background. The man is carrying a backpack, and the street is illuminated by the moonlight. The painting has a unique style, with the moon appearing as a large, yellow sun in the sky. This artistic choice adds a surreal and dreamlike quality to the scene, making it appear as if the man is walking through a fantasy world. The painting is a beautiful representation of a nighttime scene, capturing the tranquility and serenity of the moment."
diff --git a/applications/DeepSpeed-VisualChat/eval/results/eval_single/ours-single_final.csv b/applications/DeepSpeed-VisualChat/eval/results/eval_single/ours-single_final.csv
new file mode 100644
index 000000000..2b72b4814
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/eval/results/eval_single/ours-single_final.csv
@@ -0,0 +1,16 @@
+test_name,image_path,question,answer
+cat_images1,./eval/eval_data/images/cats/cat.png,please describe the image,"The image shows a close-up of a small kitten standing in a field of green plants. The kitten is looking upwards with a curious expression on its face, surrounded by the lush vegetation. The plants around the kitten come in various sizes and shapes, creating a picturesque scene of nature."
+cat_images2,./eval/eval_data/images/cats/1806905748_adb926a0a0.jpg,can you describe the image,"The image features a black cat lying on the floor, specifically on a yellow area, possibly a rug. The cat appears to be relaxed and laying on its side, with its eyes closed. The cat occupies a significant portion of the frame, drawing attention to its presence."
+cat_images3,./eval/eval_data/images/cats/british_shorthair.jpg,please describe the image,"The image features a gray and white cat lying on the floor next to a piece of carpeting. The cat is stretched out, with its body resting on the floor and one paw reaching up onto the carpet. The cat appears to be relaxed and enjoying its rest."
+extreme_ironing,./eval/eval_data/images/singles/extreme_ironing.jpg,What is unusual about this image?,"In this image, it is unusual to see a man doing laundry by hanging clothes on a clothes rack attached to a parked yellow taxi in the middle of the street. Typically, laundry would be done in a more designated or safe area, such as a laundromat or a backyard, to avoid potential hazards and obstructing traffic. The man's choice to do laundry in this manner raises concerns about safety and the flow of traffic, as it could cause distractions for drivers and pedestrians and pose a risk to both the man and others in the area."
+waterview,./eval/eval_data/images/singles/waterview.jpg,What are the things I should be cautious about when I visit here?,"When visiting this picturesque dock over a calm lake surrounded by trees, there are a few things to keep in mind to ensure a safe and enjoyable experience. First, be aware of the water conditions and always use caution when entering the lake, as it may have hidden obstacles or sudden changes in depth. Second, make sure to wear appropriate footwear to prevent slipping or falling off the dock. Third, be mindful of the weather and avoid visiting during stormy conditions, as lightning can be dangerous. Lastly, practice water safety rules, such as swimming with a buddy, staying within your swimming abilities, and not venturing too far from the shore or dock. Enjoy the serene atmosphere and the beautiful view of the mountains in the background while being cautious and respectful of the environment."
+art-dog,./eval/eval_data/images/singles/202160027_b319c4166e.jpg,can you describe the image,"The image features a large, colorful clock tower in the background, with a man standing in front of it. In the foreground, there is a unique sculpture of a dog made out of flowers, which is the main focal point of the scene. The sculpture covers a significant portion of the image, and its vibrant colors add an artistic touch to the scene. The clock tower and the man in the background provide context and a sense of place, while the flower dog sculpture captures the viewer's attention with its creativity and beauty."
+funny-phone,./eval/eval_data/images/singles/1.jpg,What is funny about this image? Describe it panel by panel.,"The image is a collage of three panels, each showing different aspects of the humor in the scene.
+
+Panel 1: A cell phone is being held by someone, with its charger still attached to a blue block. The person seems to be trying to remove the charger from the block, possibly out of frustration or curiosity.
+
+Panel 2: A close-up of the blue block, which appears to be a deactivated electronic device adapter or a device blocker, is shown. The person might be trying to figure out why the charger won't fit or if the blue block is preventing it from working.
+
+Panel 3: The cell phone is now disconnected from the charger, and the person seems to have successfully detached it from the blue block. The situation is humorous as it depicts a common everyday activity, like charging a phone, with an added element of confusion and challenge due to the presence of the blue block."
+squirrel,./eval/eval_data/images/singles/2.jpg,Why would a person find this image funny?,"A person might find this image funny because it features a playful depiction of a squirrel holding a camera. Squirrels are typically seen as small, cute, and harmless animals, and their interaction with human objects, such as a camera, adds an element of humor to the scene. The idea of a squirrel being interested in photography or posing with a camera can evoke amusement and lightheartedness, as it is an unexpected and humorous twist on the usual behavior of squirrels. This whimsical combination of a familiar animal and an object associated with human activities can elicit laughter and amusement."
+art-painting,./eval/eval_data/images/singles/50.jpg,Tell me about this work of art.,"The work of art is a unique combination of a painting and a print featuring the famous character from the comics, Spider-Man. In this piece, Spider-Man is depicted as the ""Sgt. Star Platinum"" character from the series ""The Devil Manga."" The painting or print has a vintage or old-school style, and it is set against a night sky with stars. The scene also includes a village and a house, which adds depth and context to the overall composition. This artwork is likely to appeal to fans of comics, street art, and visual storytelling."
diff --git a/applications/DeepSpeed-VisualChat/helper/README.md b/applications/DeepSpeed-VisualChat/helper/README.md
new file mode 100644
index 000000000..bd32b54e8
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/helper/README.md
@@ -0,0 +1,2 @@
+# QWen-VL's Vision Encoder
+The extract_qwen_vl.py can be used to extract the vision encoder from QWen-VL. After extraction, you can find other necessary files in the [folder](./qwen_clip).
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/helper/extract_qwen_vl.py b/applications/DeepSpeed-VisualChat/helper/extract_qwen_vl.py
new file mode 100644
index 000000000..12808b87c
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/helper/extract_qwen_vl.py
@@ -0,0 +1,14 @@
+from transformers import AutoModelForCausalLM
+import torch
+
+PATH = "Qwen/Qwen-VL-Chat"
+
+model = AutoModelForCausalLM.from_pretrained(PATH, device_map="cuda", trust_remote_code=True).eval()
+
+state_dict = model.state_dict()
+save_dict = {}
+for k,v in state_dict.items():
+    if 'visual' in k:
+        if 'transformer.visual.proj' not in k: # we don't need the proj layer
+            save_dict[k.replace('transformer.visual.', '')] = v
+torch.save(save_dict, './qwen_clip/pytorch_model.bin')
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/helper/qwen_clip/config.json b/applications/DeepSpeed-VisualChat/helper/qwen_clip/config.json
new file mode 100755
index 000000000..357bb8280
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/helper/qwen_clip/config.json
@@ -0,0 +1,180 @@
+{
+  "_commit_hash": null,
+  "architectures": [
+    "CLIPModel"
+  ],
+  "initializer_factor": 1.0,
+  "logit_scale_init_value": 2.6592,
+  "model_type": "clip",
+  "projection_dim": 1280,
+  "text_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_dropout": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": 0,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dropout": 0.0,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": 2,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "gelu",
+    "hidden_size": 1280,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_factor": 1.0,
+    "initializer_range": 0.02,
+    "intermediate_size": 5120,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-05,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "max_position_embeddings": 77,
+    "min_length": 0,
+    "model_type": "clip_text_model",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 20,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 32,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": 1,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "transformers_version": "4.24.0",
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "vocab_size": 49408
+  },
+  "text_config_dict": {
+    "hidden_act": "gelu",
+    "hidden_size": 1280,
+    "intermediate_size": 5120,
+    "num_attention_heads": 20,
+    "num_hidden_layers": 32
+  },
+  "torch_dtype": "float32",
+  "transformers_version": null,
+  "vision_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_dropout": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dropout": 0.0,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "gelu",
+    "hidden_size": 1664,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "image_size": 224,
+    "initializer_factor": 1.0,
+    "initializer_range": 0.02,
+    "intermediate_size": 8192,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-05,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "clip_vision_model",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 16,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_channels": 3,
+    "num_hidden_layers": 48,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "patch_size": 14,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "transformers_version": "4.24.0",
+    "typical_p": 1.0,
+    "use_bfloat16": false
+  },
+  "vision_config_dict": {
+    "hidden_act": "gelu",
+     "hidden_size": 1664,
+     "intermediate_size": 8192,
+     "num_attention_heads": 16,
+     "num_hidden_layers": 48,
+     "patch_size": 14
+  }
+}
diff --git a/applications/DeepSpeed-VisualChat/helper/qwen_clip/preprocessor_config.json b/applications/DeepSpeed-VisualChat/helper/qwen_clip/preprocessor_config.json
new file mode 100755
index 000000000..89a5d584a
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/helper/qwen_clip/preprocessor_config.json
@@ -0,0 +1,19 @@
+{
+  "crop_size": 448,
+  "do_center_crop": true,
+  "do_normalize": true,
+  "do_resize": true,
+  "feature_extractor_type": "CLIPFeatureExtractor",
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "resample": 3,
+  "size": 448
+}
diff --git a/applications/DeepSpeed-VisualChat/requirements.txt b/applications/DeepSpeed-VisualChat/requirements.txt
new file mode 100644
index 000000000..ae0440617
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/requirements.txt
@@ -0,0 +1,11 @@
+datasets>=2.8.0
+sentencepiece>=0.1.97
+protobuf==3.20.3
+accelerate>=0.15.0
+open_clip_torch
+deepspeed>=0.10.3
+einops
+einops_exts
+transformers==4.33.3
+transformers_stream_generator
+termcolor
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/training/README.md b/applications/DeepSpeed-VisualChat/training/README.md
new file mode 100644
index 000000000..dd4e58d26
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/training/README.md
@@ -0,0 +1,40 @@
+<div align="center">
+
+<img src="../assets/model.png" width="500"/>
+
+</div>
+
+# 🐕 Vision-Lanauge Finetuning
+Vision-Lanauge finetuning (or insturction finetuning) is very similar to standard large language model finetuning. The differences are summarized as follows:
+* It requires an extra vision encoder
+* It requires high quality text-image data pairs
+* It needs a linear projection layer that connects the visual encoder and LLM
+
+
+## 🏃 How to train the model
+Remember to prepare you data first based on [tutorial](../README.md). If you want to use 448x448 resolution images, you need go to [helper](../helper) for more information. Otherwise, you can simply run
+```bash
+ training_scripts/run_7b.sh
+ ```
+
+which shows an example to fine-tune a LLaMA-2-7B LLM together with a CLIP-large visual encoder and a linear projection layer that connects this visual encoder and LLM.
+
+
+## 👀 A few examples
+Please refer to [Chat](../chat/README.md) or [Eval](../eval/README.md) for evaluation. 
+
+<div align="center">
+<img src="../assets/banner.png" width="500" alt="Cats"/>
+</div>
+
+-----
+
+<div align="center">
+<img src="../assets/ceos.png"  width="500" alt="CEOs"/>
+</div>
+
+------
+
+<div align="center">
+<img src="../assets/friends.png" width="500" alt="Friends"/>
+</div>
diff --git a/applications/DeepSpeed-VisualChat/training/main.py b/applications/DeepSpeed-VisualChat/training/main.py
new file mode 100755
index 000000000..c345f9832
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/training/main.py
@@ -0,0 +1,442 @@
+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+import argparse
+import os
+import math
+import sys
+import numpy as np
+import random 
+
+import torch
+from torch.utils.data import DataLoader
+from torch.utils.data.distributed import DistributedSampler
+
+from transformers import (
+    SchedulerType,
+    get_scheduler,
+    AutoTokenizer
+)
+
+import deepspeed
+from transformers import AdamW
+sys.path.append(
+    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))
+from utils.data import build_dataset, DataCollatorPadToMaxLen, split_dataset, shuffle_dataset
+from utils.utils import print_rank_0, to_device, save_hf_format, set_random_seed, get_all_reduce_mean, get_optimizer_grouped_parameters, save_zero_three_model
+from utils.ds_utils import get_train_ds_config
+from utils.module.lora import convert_linear_layer_to_lora, only_optimize_lora_parameters, fuse_lora, unfuse_lora
+from utils.model import create_dsvl_model_and_transforms
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description=
+        "Finetune a transformers model on a multi-modal task")
+
+    parser.add_argument('--data_path',
+                        type=str,
+                        default='./data/',
+                        help='Where the training data are stored.')
+
+    parser.add_argument('--data_debug_path',
+                        type=str,
+                        default=None,
+                        help='If provided, will save 10 training samples'
+                        'to the path for debug purpose.')
+
+    parser.add_argument(
+        "--data_train_split_ratio",
+        type=float,
+        default=0.9,
+        help="Ratio of dataset to be splitted as train data. The remaining becomes eval data.",
+    )
+    parser.add_argument('--dataset_names',
+                        nargs='*',
+                        default=['minigpt4'],
+                        help='Name of training dataset(s) to be used. Accepted format:'
+                        '1) a single dataset name, 2) multiple dataset names in the'
+                        'form: dataset1 dataset2 ...')
+
+    parser.add_argument('--dataset_samples',
+                        nargs='*',
+                        default=['all'],
+                        help='How many samples do we use from each dataset.'
+                        'Should be either a integer number or string all which'
+                        'means use all samples. For example: all 512 means'
+                        'using all samples form first data and 512 samples'
+                        'from second data')
+    
+    parser.add_argument('--dataset_concatenate_samples',
+                        nargs='*',
+                        default=[1],
+                        help='How many samples do we concatenate from each dataset.'
+                        'Should be either a integer number or string. 1 which'
+                        'means use 1 sample for each datapoint')
+    
+    parser.add_argument(
+        "--max_num_image_per_sample",
+        type=int,
+        default=8,
+        help="The maximum number of images per sample.",
+    )
+    parser.add_argument(
+        "--per_device_train_batch_size",
+        type=int,
+        default=2,
+        help="Batch size (per device) for the training dataloader.",
+    )
+    parser.add_argument(
+        "--per_device_eval_batch_size",
+        type=int,
+        default=2,
+        help="Batch size (per device) for the evaluation dataloader.",
+    )
+    parser.add_argument(
+        "--max_seq_len",
+        type=int,
+        default=4096,
+        help="The maximum sequence length, note that image tokens are included.",
+    )
+    parser.add_argument(
+        "--learning_rate",
+        type=float,
+        default=1e-3,
+        help=
+        "Initial learning rate (after the potential warmup period) to use.",
+    )
+    parser.add_argument(
+        "--learning_rate_pretraining_components",
+        type=float,
+        default=0,
+        help=
+        "Initial learning rate for pre-trained weight, e.g., embedding (after the potential warmup period) to use.",
+    )
+    parser.add_argument("--weight_decay",
+                        type=float,
+                        default=0.,
+                        help="Weight decay to use.")
+    parser.add_argument("--num_train_epochs",
+                        type=int,
+                        default=6,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=1,
+        help=
+        "Number of updates steps to accumulate before performing a backward/update pass.",
+    )
+    parser.add_argument(
+        "--lr_scheduler_type",
+        type=SchedulerType,
+        default="cosine",
+        help="The scheduler type to use.",
+        choices=[
+            "linear", "cosine", "cosine_with_restarts", "polynomial",
+            "constant", "constant_with_warmup"
+        ],
+    )
+    parser.add_argument(
+        "--num_warmup_steps",
+        type=float,
+        default=0,
+        help="Number of steps (>1) or ratios (<=1) for the warmup in the lr scheduler.")
+    parser.add_argument("--output_dir",
+                        type=str,
+                        default=None,
+                        help="Where to store the model.")
+    parser.add_argument("--seed",
+                        type=int,
+                        default=1234,
+                        help="A seed for reproducible training.")
+    parser.add_argument("--local_rank",
+                        type=int,
+                        default=-1,
+                        help="local_rank for distributed training on gpus")
+    parser.add_argument('--gradient_checkpointing',
+                        action='store_true',
+                        help='Enable HF gradient checkpointing for model.')
+    parser.add_argument(
+        "--lm_model_name_or_path",
+        type=str,
+        help=
+        "Path to pretrained model or model identifier from huggingface.co/models.",
+        required=True,
+    )
+    parser.add_argument("--vision_model_name_or_path", default="openai/clip-vit-large-patch14", type=str)
+    parser.add_argument(
+        "--enable_mmca_attention",
+        action='store_true',
+        help="enable the new proposed attn, which is similar to cross attention",
+    )
+    parser.add_argument(
+        "--vis_proj",
+        type=str,
+        default='baseline',
+        help="[baseline, vit, or perceiver], used to projection vision feature to LLM embedding",
+    )
+    # deepspeed features
+    parser.add_argument(
+        '--zero_stage',
+        type=int,
+        default=0,
+        help='ZeRO optimization stage for Actor model (and clones).')
+    parser.add_argument(
+        "--precision",
+        type=str,
+        choices=["fp16", "bf16"],
+        default="fp16",
+        help=
+        "FP16 or BF16 precision. FP16 is recommended for typical use cases. BF16 is good for large models",
+    )
+    parser.add_argument('--enable_tensorboard',
+                        action='store_true',
+                        help='Enable tensorboard logging')
+    ## LoRA for efficient training setting
+    parser.add_argument("--lang_lora_dim",
+                        type=int,
+                        default=0,
+                        help="Use LoRA for fine-tuning language decoder (> 0).")
+    parser.add_argument("--lang_lora_module_name",
+                        type=str,
+                        default="model.layers.",
+                        help="The scope name of the target LoRA parameters.")
+    parser.add_argument("--vis_lora_dim",
+                        type=int,
+                        default=0,
+                        help="Use LoRA for fine-tuning visual encoder (> 0).")
+    parser.add_argument("--vis_lora_module_name",
+                        type=str,
+                        default="encoder.layers.",
+                        help="The scope name of the target LoRA parameters.")
+    parser.add_argument('--only_optimize_lora',
+                        action='store_true',
+                        help='Only optimize the LoRA parameters.')
+
+
+    parser = deepspeed.add_config_arguments(parser)
+    args = parser.parse_args()
+
+    if args.learning_rate_pretraining_components == 0.0:
+        # if we do not provide special learning rate, mainly for embedding, the same lr is applied
+        args.learning_rate_pretraining_components = args.learning_rate
+    assert args.num_warmup_steps >= 0, "--num_warmup_steps must be >= 0"
+    if 'qwen' in args.vision_model_name_or_path.lower():
+        assert args.vis_proj == 'baseline', "qwen's model only support baseline vis_proj as it has the perceiver module inside"
+    return args
+
+
+def main():
+    args = parse_args()
+
+    if args.local_rank == -1:
+        device = torch.device("cuda")
+    else:
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        deepspeed.init_distributed()
+
+    args.global_rank = torch.distributed.get_rank()
+
+    ds_config = get_train_ds_config(args, offload=False,
+                                    stage=args.zero_stage)
+    ds_config[
+        'train_micro_batch_size_per_gpu'] = args.per_device_train_batch_size
+    ds_config[
+        'train_batch_size'] = args.per_device_train_batch_size * torch.distributed.get_world_size(
+        ) * args.gradient_accumulation_steps
+
+    # If passed along, set the training seed now.
+    set_random_seed(args.seed)
+
+    torch.distributed.barrier()
+    tokenizer = AutoTokenizer.from_pretrained(args.lm_model_name_or_path,
+                                              fast_tokenizer=True)
+    tokenizer.padding_side = 'right'
+    model, image_processor, tokenizer = create_dsvl_model_and_transforms(
+            text_tokenizer=tokenizer,
+            args=args,
+            ds_config=ds_config)  
+    if args.lang_lora_dim > 0:
+        model.lang_decoder = convert_linear_layer_to_lora(model.lang_decoder, args.lang_lora_module_name, args.lang_lora_dim)
+        if args.only_optimize_lora:
+            model.lang_decoder = only_optimize_lora_parameters(model.lang_decoder)
+
+    if args.vis_lora_dim > 0:
+        model.vis_encoder = convert_linear_layer_to_lora(model.vis_encoder, args.vis_lora_module_name, args.vis_lora_dim)
+        if args.only_optimize_lora:
+            model.vis_encoder = only_optimize_lora_parameters(model.vis_encoder)
+
+    print_rank_0(model, args.global_rank) 
+        
+    # Prepare the data
+    if len(args.dataset_samples) < len(args.dataset_names):
+        assert len(args.dataset_samples) == 1, "when args.dataset_samples is not the same length as args.dataset_names, it should be only one number"
+        args.dataset_samples =  [args.dataset_samples[0]] * len(args.dataset_names)
+    if len(args.dataset_concatenate_samples) < len(args.dataset_names):
+        assert len(args.dataset_concatenate_samples) == 1, "when args.dataset_concatenate_samples is not the same length as args.dataset_names, it should be only one number"
+        args.dataset_concatenate_samples =  [args.dataset_concatenate_samples[0]] * len(args.dataset_names)
+    # convert to int
+    args.dataset_concatenate_samples = [int(i) for i in args.dataset_concatenate_samples]
+
+    dataset = build_dataset(
+        args.data_path,
+        args.data_debug_path,
+        args.dataset_names,
+        args.dataset_samples,
+        args.dataset_concatenate_samples,
+        args.max_num_image_per_sample,
+        vis_processor=image_processor,
+        tokenizer=tokenizer,
+    )
+    # split the dataset into train and evaluation
+    total_data = len(dataset)
+    np_rng = np.random.RandomState(seed=args.seed)
+    dataset = shuffle_dataset(dataset, np_rng)
+    train_dataset, eval_dataset = split_dataset(dataset, args.data_train_split_ratio)
+
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_size=args.per_device_train_batch_size,
+        sampler=DistributedSampler(train_dataset, shuffle=True, drop_last=True),
+        collate_fn=DataCollatorPadToMaxLen(args.max_seq_len, tokenizer.pad_token_id),
+    )
+
+    eval_dataloader = DataLoader(
+        eval_dataset,
+        batch_size=args.per_device_eval_batch_size,
+        sampler=DistributedSampler(eval_dataset, shuffle=False),
+        collate_fn=DataCollatorPadToMaxLen(args.max_seq_len, tokenizer.pad_token_id),
+    )
+
+    # Split weights in two groups, one with weight decay and the other not.
+    optimizer_grouped_parameters = get_optimizer_grouped_parameters(
+        model, args.weight_decay, small_lr=args.learning_rate_pretraining_components)
+
+    optimizer = AdamW(optimizer_grouped_parameters,
+                              lr=args.learning_rate,
+                              betas=(0.9, 0.95))
+
+    num_update_steps_per_epoch = math.ceil(
+        len(train_dataloader) / args.gradient_accumulation_steps)
+    if args.num_warmup_steps <= 1:
+        args.num_warmup_steps = int(args.num_warmup_steps * args.num_train_epochs * num_update_steps_per_epoch)
+    else:
+        args.num_warmup_steps = int(args.num_warmup_steps)
+    lr_scheduler = get_scheduler(
+        name=args.lr_scheduler_type,
+        optimizer=optimizer,
+        num_warmup_steps=args.num_warmup_steps,
+        num_training_steps=args.num_train_epochs * num_update_steps_per_epoch,
+    )
+
+    model, optimizer, _, lr_scheduler = deepspeed.initialize(
+        model=model,
+        optimizer=optimizer,
+        args=args,
+        config=ds_config,
+        lr_scheduler=lr_scheduler,
+        dist_init_required=True)
+
+    start_epoch = 0
+    # let load checkpoint 
+    if os.path.exists(os.path.join(args.output_dir, 'latest')):
+        # we have the deepspeed chekpoint so it is a resumed job
+        # TODO: after loading the ckpt, the global step is not loaded. Need to ask Tunji/Ammar for help.
+        _, client_state = model.load_checkpoint(args.output_dir)
+        start_epoch = client_state['epoch']
+        best_loss = client_state['best_loss']
+        random.setstate(client_state['random_rng_state'])
+        np.random.set_state(client_state['np_rng_state'])
+        torch.set_rng_state(client_state['torch_rng_state'])
+        torch.cuda.set_rng_state(client_state['torch_cuda_rng_state'])
+
+    if args.gradient_checkpointing:
+        model.gradient_checkpointing_enable()
+
+    def evaluation(model, eval_dataloader):
+        model.eval()
+        acc_loss = 0
+        for step, batch in enumerate(eval_dataloader):
+            with torch.no_grad():
+                batch = to_device(batch, device)
+                loss = model(
+                    batch["image"].half() ,
+                    batch["input_ids"],
+                    attention_mask=batch["attention_mask"],
+                    input_labels=batch["labels"],
+                    image_num=batch["image_num"],
+                )[0]
+            acc_loss += loss
+        model.train()
+        acc_loss = get_all_reduce_mean(acc_loss).item()
+        ave_loss = acc_loss / (step + 1)
+        print_rank_0(f"the eval average_loss: {ave_loss}", args.global_rank)
+        return ave_loss
+    
+    # Train!
+    if start_epoch == 0:
+        print_rank_0("***** Before training *****", args.global_rank)
+        evaluation(model, eval_dataloader)
+        best_loss = 1e6
+
+    print_rank_0("***** Running training *****", args.global_rank)
+    for epoch in range(start_epoch, args.num_train_epochs):
+        print_rank_0(
+            f"Beginning of Epoch {epoch+1}/{args.num_train_epochs}, Total Micro Batches {len(train_dataloader)}",
+            args.global_rank)
+        model.train()
+        acc_loss = 0
+        for step, batch in enumerate(train_dataloader):
+            batch = to_device(batch, device)  #torch.size(1, 3, 224, 224]) #torch.Size([1, 1, 3, 224, 224])
+            images = batch["image"].half() 
+            input_ids = batch["input_ids"]
+            attention_mask = batch["attention_mask"]
+            labels = batch["labels"]
+            loss = model(
+                images,
+                input_ids,
+                attention_mask=attention_mask,
+                input_labels=labels,
+                image_num=batch["image_num"],
+            )[0]
+            acc_loss += loss.detach().clone()
+            model.backward(loss)
+            model.step()
+        model.tput_timer.update_epoch_count()
+        acc_loss = get_all_reduce_mean(acc_loss).item()
+        print_rank_0(f"Epoch {epoch+1}, the average_loss: {acc_loss/step}", args.global_rank)
+        eval_loss = evaluation(model, eval_dataloader)
+
+        
+        if eval_loss < best_loss:
+            best_loss = eval_loss
+
+        model = fuse_lora(model)
+        if args.global_rank == 0:
+            save_hf_format(model, tokenizer, args, f'epoch-{epoch}')
+        if args.zero_stage == 3:
+            # For zero stage 3, each gpu only has a part of the model, so we need a special save function
+            save_zero_three_model(model,
+                                args.global_rank,
+                                args.output_dir,
+                                zero_stage=args.zero_stage, 
+                                sub_folder=f'epoch-{epoch}')
+        model = unfuse_lora(model)
+        # save deepspeed zero checkpoint so we can resume training if needed
+        client_state = {
+            'random_rng_state': random.getstate(),
+            'np_rng_state': np.random.get_state(),
+            'torch_rng_state': torch.get_rng_state(),
+            'torch_cuda_rng_state': torch.cuda.get_rng_state(),
+            'epoch': epoch + 1, # start from next epoch
+            'best_loss': best_loss,
+        }
+        model.save_checkpoint(args.output_dir, client_state=client_state) # save to the latest
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/training/training_scripts/run_7b.sh b/applications/DeepSpeed-VisualChat/training/training_scripts/run_7b.sh
new file mode 100755
index 000000000..de7019536
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/training/training_scripts/run_7b.sh
@@ -0,0 +1,47 @@
+#!/bin/bash
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+
+VISION_MODEL=openai/clip-vit-large-patch14
+LLM=meta-llama/Llama-2-7b
+
+
+
+EPOCH=6
+ZERO_STAGE=3
+lr=1e-3
+
+DATA_PATH=./data
+DATA="llava llava_dial otter_mimicit_cgd otter_mimicit_sd otter_mimicit_sn otter_mimicit_tvc otter_mimicit_vst llava_otter_blend sparkles_dialogue"
+DATA_SAMPLE="all"
+IMAGE_PER_SAMPLE="3 2 1 1 1 1 1 1 1"
+
+DATA_CONCATE="${DATA// /_}"
+DATA_SAMPLE_CONCATE="${DATA_SAMPLE// /_}"
+IMAGE_CONCATE="${IMAGE_PER_SAMPLE// /_}"
+# 
+
+OUTPUT_Base=./output/
+
+OUTPUT_Dir=Epoch${EPOCH}_LR${lr}_data_${DATA_CONCATE}_${DATA_SAMPLE_CONCATE}_${IMAGE_CONCATE}
+
+OUTPUT=${OUTPUT_Base}${OUTPUT_Dir}
+
+if [ "$ZERO_STAGE" == "" ]; then
+    ZERO_STAGE=0
+fi
+
+mkdir -p $OUTPUT
+mkdir -p ./log/$OUTPUT_Dir/
+
+# we assume the batch size is 128, which means Num_GPU * per_device_train_batch_size * gradient_accumulation_steps
+deepspeed main.py --max_seq_len 4096 \
+    --data_path ${DATA_PATH} \
+    --dataset_names ${DATA} --dataset_samples ${DATA_SAMPLE} --dataset_concatenate_samples ${IMAGE_PER_SAMPLE} --max_num_image_per_sample 8 \
+    --lm_model_name_or_path  ${LLM} \
+    --vision_model_name_or_path ${VISION_MODEL} \
+    --gradient_checkpointing --vis_proj baseline \
+    --gradient_accumulation_steps 1  --zero_stage $ZERO_STAGE --learning_rate $lr --num_warmup_steps 0.1 \
+    --per_device_train_batch_size 1 --per_device_eval_batch_size 2 --deepspeed --output_dir $OUTPUT  --num_train_epochs ${EPOCH} --enable_mmca_attention --enable_tensorboard
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/utils/data/DST.py b/applications/DeepSpeed-VisualChat/utils/data/DST.py
new file mode 100644
index 000000000..cf5880f3f
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/DST.py
@@ -0,0 +1,139 @@
+from typing import Iterable
+import random
+import numpy as np
+## the following codes are adopted from https://github.com/haotian-liu/LLaVA
+## the following codes are adopted from https://github.com/open-mmlab/Multimodal-GPT 
+## the following codes are adopted from https://github.com/Luodian/Otter/
+
+# deepspeed template
+
+DEFAULT_SYSTEM_TOKEN="### System instuction:"
+DEFAULT_PROMPT = f"You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.\n\n"
+
+DEFAULT_IMAGE_TOKEN = "<image>"
+DEFAULT_HUMAN_TOKEN = "### Human:"
+DEFAULT_HUMAN_QUESTION_PRETOKEN = "### Question:"
+DEFAULT_QUESTION_TOKEN = "<question>"
+DEFAULT_HUMAN_IMAGE_PRETOKEN = "### Image:"
+
+DEFAULT_ASSISTANT_TOKEN = "### Answer:"
+DEFAULT_ANSWER_TOKEN = "<answer>"
+
+DEFAULT_ASSISTANT_END_ROUND_TOKEN="<endofchunk>"
+DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
+DEFAULT_IM_START_TOKEN = "<im_start>"
+DEFAULT_IM_END_TOKEN = "<im_end>"
+
+IMAGE_NUM = '<image#x>'
+IMAGE_NUM_1 = '### Image 1:'
+IMAGE_NUM_2 = '### Image 2:'
+IMAGE_NUM_3 = '### Image 3:'
+IMAGE_NUM_4 = '### Image 4:'
+IMAGE_NUM_5 = '### Image 5:'
+IMAGE_NUM_6 = '### Image 6:'
+IMAGE_NUM_7 = '### Image 7:'
+IMAGE_NUM_8 = '### Image 8:'
+
+# fow now we at most support 8 images, can be extended to more
+image_mapping_dict = {"default": DEFAULT_HUMAN_IMAGE_PRETOKEN, "1": IMAGE_NUM_1, "2": IMAGE_NUM_2, "3": IMAGE_NUM_3, "4": IMAGE_NUM_4, "5": IMAGE_NUM_5, "6": IMAGE_NUM_6, "7": IMAGE_NUM_7, "8": IMAGE_NUM_8}
+
+special_token_list = [DEFAULT_HUMAN_IMAGE_PRETOKEN, DEFAULT_IMAGE_TOKEN] # used for easy image # replacement
+
+DEFAULT_LABEL_PADDING_NUM = -100
+
+def add_special_token(tokenizer):
+    tokenizer.add_tokens(special_token_list, special_tokens=True)
+    if tokenizer.pad_token is None:
+        # Issue: GPT models don't have a pad token, which we use to
+        # modify labels for the loss.
+        tokenizer.add_special_tokens({"pad_token": "<PAD>"})
+    return tokenizer
+
+def get_image_num_map(tokenizer):
+    image_num_map = {}
+    for key in image_mapping_dict:
+        image_num_map[image_mapping_dict[key]] = tokenizer(image_mapping_dict[key])['input_ids'][1:] # remove <s>
+    image_num_map[DEFAULT_HUMAN_IMAGE_PRETOKEN] = image_num_map[DEFAULT_HUMAN_IMAGE_PRETOKEN][0] # convert list to number 
+    return image_num_map
+
+TEMPLATE = {
+    "description": "Template Modified by DeepSpeed Team for Chat.",
+    "prompt_qa_with_image": f'''{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n\n{DEFAULT_HUMAN_QUESTION_PRETOKEN}\n{DEFAULT_QUESTION_TOKEN}\n\n{DEFAULT_ASSISTANT_TOKEN}\n''',
+    "prompt_qa_without_image": f'''{DEFAULT_HUMAN_QUESTION_PRETOKEN}\n{DEFAULT_QUESTION_TOKEN}\n\n{DEFAULT_ASSISTANT_TOKEN}\n''',
+}
+
+class Prompter:
+    def __call__(self, question, with_image=True, first_message=False, num_images=-1, options=None):
+        if options:
+            raise NotImplementedError("options not supported yet")
+            options = ", ".join(options)
+            res = TEMPLATE["prompt_choice"].format(image=DEFAULT_IMAGE_TOKEN, question=question, options=options)
+        else:
+            if with_image:
+                res = TEMPLATE["prompt_qa_with_image"].replace(DEFAULT_QUESTION_TOKEN, question)
+                if num_images >= 1:
+                    tmp_dict = {
+                        1: f"{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n\n",
+                        2: f"{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n\n",
+                        3: f"{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n\n",
+                        4: f"{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n\n",
+                        5: f"{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n\n",
+                        6: f"{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n\n",
+                        7: f"{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n\n",
+                        8: f"{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n\n",
+                    }
+                    res = res.replace(f"{DEFAULT_HUMAN_IMAGE_PRETOKEN}\n{DEFAULT_IMAGE_TOKEN}\n\n", tmp_dict[num_images])
+            else:
+                res = TEMPLATE["prompt_qa_without_image"].replace(DEFAULT_QUESTION_TOKEN, question)
+            
+            if first_message:
+                res = DEFAULT_PROMPT + res
+        return res
+
+    def get_response(self, output: str) -> str:
+        return output.split(TEMPLATE["response_split"])[-1].strip()
+
+def _flatten(items):
+    """Yield items from any nested iterable; see Reference."""
+    for x in items:
+        if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):
+            for sub_x in flatten(x):
+                yield sub_x
+        else:
+            yield x
+
+def flatten(items):
+    return list(_flatten(items))
+
+
+def split_list_with_random_num_items_up_to_a_certain_number(input_list, max_num):
+    if len(input_list) <= max_num:
+        return [input_list]
+    else:
+        random_num = random.randint(1, max_num)
+        return [input_list[:random_num]] + split_list_with_random_num_items_up_to_a_certain_number(input_list[random_num:], max_num)
+            
+def random_grouping(input_list, max_num):
+    random.shuffle(input_list)
+    random_num = np.random.randint(1, max_num+1, len(input_list))
+    # use bisect to find the index of random_num, whose sum is equal or large to len(input_list)
+    # then split the input_list into groups
+    cum_sum = np.cumsum(random_num)
+    # find the index now
+    left = 0
+    right = len(cum_sum) - 1
+    while left < right:
+        mid = (left + right) // 2
+        if cum_sum[mid] >= len(input_list):
+            right = mid
+        else:
+            left = mid + 1
+    index = left
+    cum_sum = list(cum_sum[:index+1])
+    if cum_sum[-1] > len(input_list):
+        cum_sum[-1] = len(input_list)
+    elif cum_sum[-1] < len(input_list):
+        cum_sum.append(len(input_list))
+
+    return [input_list[cum_sum[i]:cum_sum[i+1]] for i in range(len(cum_sum)-1)]
+    # return split_list_with_random_num_items_up_to_a_certain_number(input_list, max_num)
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/utils/data/__init__.py b/applications/DeepSpeed-VisualChat/utils/data/__init__.py
new file mode 100644
index 000000000..31de795ba
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/__init__.py
@@ -0,0 +1,6 @@
+# This file is adapted from https://github.com/open-mmlab/Multimodal-GPT
+
+from .builder import build_dataset  # noqa: F401
+from .vqa_dataset import VQADataset  # noqa: F401
+from .utils import DataCollatorPadToMaxLen, split_dataset, shuffle_dataset  # noqa: F401
+from .DST import add_special_token
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/utils/data/aokvqa_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/aokvqa_dataset.py
new file mode 100644
index 000000000..bc204dba3
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/aokvqa_dataset.py
@@ -0,0 +1,59 @@
+# This file is adapted from https://github.com/open-mmlab/Multimodal-GPT
+# This dataset is from https://allenai.org/project/a-okvqa/home
+import os
+import random
+from PIL import Image
+
+from .vqa_dataset import VQADataset
+from utils.utils import get_rank
+from .utils import save_debug_image, save_debug_text
+
+REASON_QUESTIONS = [
+    "Why?",
+    "Why is this?",
+    "And why?",
+    "What is the reason?",
+    "And can you tell me why?",
+    "Can you tell me why?",
+    "Can you tell me the reason?",
+]
+
+
+class AOKVQADataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, tokenizer, vis_processor, **kwargs):
+        vis_root = f"{data_path}/coco/train2017"
+        assert os.path.isdir(vis_root), f"AOKVQADataset image directory {vis_root} not found, you need to download 2017 Train images from https://cocodataset.org/#download"
+        ann_paths = ["aokvqa/annotations/aokvqa_v1p0_train.json"]
+        for idx in range(len(ann_paths)):
+            ann_paths[idx] = f"{data_path}/{ann_paths[idx]}"
+            assert os.path.isfile(ann_paths[idx]), f"AOKVQADataset annotation file {ann_paths[idx]} not found, you need to download it from https://allenai.org/project/a-okvqa/home"
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         vis_root, ann_paths, **kwargs)
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=True):
+        question = ann["question"]
+        question = question + " " + random.choice(REASON_QUESTIONS)
+
+        choices = ann["choices"]
+        true_answer = choices[ann["correct_choice_idx"]]
+        answer = "The answer is " + true_answer + ". Because " + " ".join(ann["rationales"])
+
+        is_option = random.random() < self.option_prob and len(choices) > 1 # let's not do option for now
+        # if is_option:
+        #     instruction = self.prompter(question, choices)
+        # else:
+        instruction = self.prompter(question, with_image=True, first_message=first_message)
+        save_debug_text([instruction, answer], data_debug_path, data_debug_counter, get_rank())
+        return dict(instruction=instruction, answer=answer)
+    
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        image_path = os.path.join(self.vis_root, str(ann["image_id"]).rjust(12, '0') + ".jpg")
+        save_debug_image(image_path, data_debug_path, data_debug_counter, get_rank(), img_idx=0)
+        image = Image.open(image_path).convert("RGB")
+
+        image = self.vis_processor(image)
+        try:
+            image = image['pixel_values'][0]
+            return image
+        except:
+            return image
diff --git a/applications/DeepSpeed-VisualChat/utils/data/builder.py b/applications/DeepSpeed-VisualChat/utils/data/builder.py
new file mode 100644
index 000000000..237af28ab
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/builder.py
@@ -0,0 +1,140 @@
+# This file is adapted from https://github.com/open-mmlab/Multimodal-GPT
+
+import numpy as np
+import torch
+
+from .aokvqa_dataset import AOKVQADataset  # noqa: F401
+from .cc_sbu_align_dataset import CcSbuAlignDataset  # noqa: F401
+from .coco_caption_dataset import COCOCaptionDataset  # noqa: F401
+from .dial_dataset import DialDataset  # noqa: F401
+from .llava_dataset import LlavaDataset  # noqa: F401
+from .llava_otter_blend_dataset import LlavaOtterBlendDataset  # noqa: F401
+from .ocr_vqa_dataset import OCRVQADataset  # noqa: F401
+from .otter_mimicit_cgd_dataset import OtterMimicitCgdDataset  # noqa: F401
+from .otter_mimicit_sd_dataset import OtterMimicitSdDataset  # noqa: F401
+from .otter_mimicit_sn_dataset import OtterMimicitSnDataset  # noqa: F401
+from .otter_mimicit_tvc_dataset import OtterMimicitTvcDataset  # noqa: F401
+from .otter_mimicit_vst_dataset import OtterMimicitVstDataset  # noqa: F401
+from .sparkles_dialogue_dataset import SparklesDialogueDataset  # noqa: F401
+from .vqa_dataset import ConcatDataset  # noqa: F401
+from utils.utils import print_rank_0
+
+
+def build_dataset(data_path, data_debug_path, dataset_name, dataset_sample,
+                  dataset_concatenate_samples, max_num_image_per_sample, **kwargs):
+    if isinstance(dataset_name, list):
+        datasets = [build_dataset(data_path, data_debug_path,
+                                  dataset_name[i], dataset_sample[i],
+                                  dataset_concatenate_samples[i],
+                                  max_num_image_per_sample,
+                                  **kwargs) for i in range(len(dataset_name))]
+        return ConcatDataset(datasets)
+    if dataset_name == "aokvqa":
+        dataset = AOKVQADataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            **kwargs,
+        )
+    elif dataset_name == "coco_caption":
+        dataset = COCOCaptionDataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            **kwargs,
+        )
+    elif dataset_name == "llava":
+        dataset = LlavaDataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            **kwargs,
+        )
+    elif dataset_name == "llava_dial":
+        dataset = DialDataset(
+            dataset_name,
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            **kwargs,
+        )
+    elif dataset_name == "llava_otter_blend":
+        dataset = LlavaOtterBlendDataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            followup=False,
+            **kwargs,
+        )
+    elif dataset_name == "minigpt4":
+        dataset = CcSbuAlignDataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            **kwargs,
+        )
+    elif dataset_name == "ocr_vqa":
+        dataset = OCRVQADataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            **kwargs,
+        )
+    elif dataset_name == "otter_mimicit_cgd":
+        dataset = OtterMimicitCgdDataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            **kwargs,
+        )
+    elif dataset_name == "otter_mimicit_sd":
+        dataset = OtterMimicitSdDataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            **kwargs,
+        )
+    elif dataset_name == "otter_mimicit_sn":
+        dataset = OtterMimicitSnDataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            max_num_image_per_sample,
+            **kwargs,
+        )
+    elif dataset_name == "otter_mimicit_tvc":
+        dataset = OtterMimicitTvcDataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            max_num_image_per_sample,
+            **kwargs,
+        )
+    elif dataset_name == "otter_mimicit_vst":
+        dataset = OtterMimicitVstDataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            max_num_image_per_sample,
+            **kwargs,
+        )
+    elif dataset_name == "sparkles_dialogue":
+        dataset = SparklesDialogueDataset(
+            data_path,
+            data_debug_path,
+            dataset_concatenate_samples,
+            **kwargs,
+        )
+    else:
+        raise NotImplementedError
+
+    if dataset_sample != 'all':
+        dataset_sample = int(dataset_sample)
+        random_indices = np.random.choice(len(dataset), min(dataset_sample, len(dataset)), replace=False)
+        subsample_dataset = torch.utils.data.Subset(dataset, random_indices)
+        subsample_dataset.collater = dataset.collater
+        print_rank_0(f"[DATA] Built dataset {dataset_name} with {len(subsample_dataset)} samples.")
+        return subsample_dataset
+    else:
+        print_rank_0(f"[DATA] Built dataset {dataset_name} with all {len(dataset)} samples.")
+        return dataset
diff --git a/applications/DeepSpeed-VisualChat/utils/data/cc_sbu_align_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/cc_sbu_align_dataset.py
new file mode 100644
index 000000000..843b0c355
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/cc_sbu_align_dataset.py
@@ -0,0 +1,103 @@
+# This file is adapted from https://github.com/open-mmlab/Multimodal-GPT
+# This dataset is from https://minigpt-4.github.io/
+
+import os
+import random
+from PIL import Image
+
+from .vqa_dataset import VQADataset
+import utils.data.DST as DST 
+from utils.utils import get_rank
+from .utils import save_debug_image, save_debug_text
+
+
+QUESTIONS = [
+    "please describe the image",
+    "can you describe the image",
+    "Could you provide a description of the image?",
+    "What do you see in this image?",
+    "Share your thoughts on the content of the image.",
+    "Please narrate what's happening in the picture.",
+    "Can you give a brief explanation of the image?",
+    "Describe the main elements and details present in the image.",
+    "In your own words, what is depicted in the image?",
+    "Can you outline the key aspects of the image?",
+    "What are the most striking features in this image?",
+    "Please provide a summary of the image's content.",
+    "Describe the overall theme or concept captured in the image.",
+    "How would you explain the image's composition and focus?",
+    "What is the focal point or main subject of the image?",
+    "How do the different components of the image interact with each other?",
+    "What would be a fitting caption for this image?",
+    "Can you create a concise description that captures the essence of the image?",
+    "How would you briefly summarize the content of this image in a phrase or sentence?",
+    "Please provide a catchy and relevant caption for this picture.",
+    "If you were to give this image a title, what would it be?",
+    "Describe the image in one creative sentence.",
+    "Please suggest a memorable phrase that encapsulates the image's content.",
+    "What engaging phrase would best represent this image?",
+    "Can you create an expressive caption that highlights the main theme of the image?",
+    "How would you sum up the image's story for a caption?",
+    "Provide an eye-catching caption that conveys the image's core message.",
+    "If you were to give this image a headline, what would it say?",
+    "Can you craft a captivating caption that communicates the essence of the image?",
+    "How would you describe the image's content in a powerful caption?",
+    "Please provide an inventive title to summarize the scene depicted in the image.",
+    "Compose a concise and striking phrase that reflects the image's key elements.",
+    "If you were to create a caption for this image, what would it be?",
+    "Offer a compelling caption that highlights the central focus of the image.",
+    "Can you produce a unique caption that encapsulates the image's overall mood?",
+    "Please generate an attention-grabbing caption that would best illustrate the events captured in this image",
+    "How would you express the image's main idea in an impactful sentence?",
+    "Please create a vivid and concise title that conveys the essence of the picture.",
+    "Compose an imaginative caption that reflects the image's most striking features.",
+    "What memorable statement would best represent the scene illustrated in this image?",
+    "Draft an evocative caption that brings the image to life for the reader.",
+    "Can you suggest an insightful caption that highlights the underlying message of the image?",
+    "What engaging phrase would effectively convey the action or subject matter depicted in this picture?",
+    "How would you encapsulate the image's core theme in a concise and expressive manner?",
+    "Please provide a creative and impactful title that captures the spirit of the image.",
+    "Craft a captivating caption that showcases the image's most prominent attributes.",
+    "What intriguing statement would best sum up the scene presented in this image?",
+    "Develop a descriptive caption that paints a vivid picture for the viewer.",
+    "Can you give a detailed account of the image's contents?",
+    "What are the key elements and features visible in this image?",
+    "How would you narrate the events or actions depicted in the picture?",
+    "Please share your observations about the various components present in the image.",
+    "What is the overall theme or concept captured in this image? Can you describe it?",
+]
+
+
+class CcSbuAlignDataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, tokenizer, vis_processor, add_eos=True, ignore_instruction=True, **kwargs):
+        vis_root = f"{data_path}/cc_sbu_align/image"
+        assert os.path.isdir(vis_root), f"CcSbuAlignDataset image directory {vis_root} not found, you need to download it from https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align"
+
+        ann_paths = ["cc_sbu_align/filter_cap.json"]
+        real_ann_paths = []
+        for ann_path in ann_paths:
+            ann_path = f"{data_path}/{ann_path}"
+            real_ann_paths.append(ann_path)
+            assert os.path.isfile(ann_path), f"CcSbuAlignDataset annotation file {ann_path} not found, you need to download it from https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align"
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         vis_root, real_ann_paths, annotation_key="annotations", **kwargs)
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=True):
+        # random select a question
+        question = random.choice(QUESTIONS)
+        answer = ann["caption"]
+        instruction = self.prompter(question, with_image=True, first_message=first_message)
+        save_debug_text([instruction, answer], data_debug_path, data_debug_counter, get_rank())
+        return dict(instruction=instruction, answer=answer)
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        image_path = os.path.join(self.vis_root, ann["image_id"] + ".jpg")
+        save_debug_image(image_path, data_debug_path, data_debug_counter, get_rank(), img_idx=0)
+        image = Image.open(image_path).convert("RGB")
+
+        image = self.vis_processor(image)
+        try:
+            image = image['pixel_values'][0]
+            return image
+        except:
+            return image
diff --git a/applications/DeepSpeed-VisualChat/utils/data/coco_caption_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/coco_caption_dataset.py
new file mode 100644
index 000000000..9dce9bca8
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/coco_caption_dataset.py
@@ -0,0 +1,115 @@
+# This file is adapted from https://github.com/open-mmlab/Multimodal-GPT
+# This dataset is from https://cs.stanford.edu/people/karpathy/deepimagesent/
+
+"""
+ Copyright (c) 2022, salesforce.com, inc.
+ All rights reserved.
+ SPDX-License-Identifier: BSD-3-Clause
+ For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+
+import os
+import random
+from PIL import Image
+
+from .vqa_dataset import VQADataset
+from utils.utils import get_rank
+from .utils import save_debug_image, save_debug_text
+
+QUESTIONS = [
+    "please describe the image",
+    "can you describe the image",
+    "Could you provide a description of the image?",
+    "What do you see in this image?",
+    "Share your thoughts on the content of the image.",
+    "Please narrate what's happening in the picture.",
+    "Can you give a brief explanation of the image?",
+    "Describe the main elements and details present in the image.",
+    "In your own words, what is depicted in the image?",
+    "Can you outline the key aspects of the image?",
+    "What are the most striking features in this image?",
+    "Please provide a summary of the image's content.",
+    "Describe the overall theme or concept captured in the image.",
+    "How would you explain the image's composition and focus?",
+    "What is the focal point or main subject of the image?",
+    "How do the different components of the image interact with each other?",
+    "What would be a fitting caption for this image?",
+    "Can you create a concise description that captures the essence of the image?",
+    "How would you briefly summarize the content of this image in a phrase or sentence?",
+    "Please provide a catchy and relevant caption for this picture.",
+    "If you were to give this image a title, what would it be?",
+    "Describe the image in one creative sentence.",
+    "Please suggest a memorable phrase that encapsulates the image's content.",
+    "What engaging phrase would best represent this image?",
+    "Can you create an expressive caption that highlights the main theme of the image?",
+    "How would you sum up the image's story for a caption?",
+    "Provide an eye-catching caption that conveys the image's core message.",
+    "If you were to give this image a headline, what would it say?",
+    "Can you craft a captivating caption that communicates the essence of the image?",
+    "How would you describe the image's content in a powerful caption?",
+    "Please provide an inventive title to summarize the scene depicted in the image.",
+    "Compose a concise and striking phrase that reflects the image's key elements.",
+    "If you were to create a caption for this image, what would it be?",
+    "Offer a compelling caption that highlights the central focus of the image.",
+    "Can you produce a unique caption that encapsulates the image's overall mood?",
+    "Please generate an attention-grabbing caption that would best illustrate the events captured in this image",
+    "How would you express the image's main idea in an impactful sentence?",
+    "Please create a vivid and concise title that conveys the essence of the picture.",
+    "Compose an imaginative caption that reflects the image's most striking features.",
+    "What memorable statement would best represent the scene illustrated in this image?",
+    "Draft an evocative caption that brings the image to life for the reader.",
+    "Can you suggest an insightful caption that highlights the underlying message of the image?",
+    "What engaging phrase would effectively convey the action or subject matter depicted in this picture?",
+    "How would you encapsulate the image's core theme in a concise and expressive manner?",
+    "Please provide a creative and impactful title that captures the spirit of the image.",
+    "Craft a captivating caption that showcases the image's most prominent attributes.",
+    "What intriguing statement would best sum up the scene presented in this image?",
+    "Develop a descriptive caption that paints a vivid picture for the viewer.",
+    "Can you give a detailed account of the image's contents?",
+    "What are the key elements and features visible in this image?",
+    "How would you narrate the events or actions depicted in the picture?",
+    "Please share your observations about the various components present in the image.",
+    "What is the overall theme or concept captured in this image? Can you describe it?",
+]
+
+
+class COCOCaptionDataset(VQADataset):
+    def __init__(
+        self, data_path, data_debug_path, per_sample_image, tokenizer, vis_processor=None, add_eos=True, ignore_instruction=True, **kwargs
+    ):
+        """
+        vis_root (string): Root directory of images (e.g. coco/images/)
+        ann_root (string): directory to store the annotation file
+        """
+        self.vis_root = f"{data_path}/coco/2014"
+        assert os.path.isdir(self.vis_root), f"COCOCaptionDataset image directory {self.vis_root} not found, you need to download 2014 Train images and 2014 Val images from https://cocodataset.org/#download"
+        ann_paths =  ["coco_caption/dataset.json"]
+        real_ann_paths = []
+        for ann_path in ann_paths:
+            ann_path = f"{data_path}/{ann_path}"
+            real_ann_paths.append(ann_path)
+            assert os.path.isfile(ann_path), f"COCOCaptionDataset annotation file {ann_path} not found, you need to download it from https://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip"
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         self.vis_root, real_ann_paths, annotation_key="images", **kwargs)
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        image_path = os.path.join(self.vis_root, ann["filename"])
+        save_debug_image(image_path, data_debug_path, data_debug_counter, get_rank(), img_idx=0)
+        image = Image.open(image_path).convert("RGB")
+
+        image = self.vis_processor(image)
+        try:
+            image = image['pixel_values'][0]
+            return image
+        except:
+            return image
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=True):
+        all_captions = ann["sentences"]
+        if not isinstance(all_captions, list):
+            all_captions = [all_captions]
+        caption = random.choice(all_captions)
+        caption = caption['raw']
+        instruction = self.prompter(random.choice(QUESTIONS), with_image=True, first_message=first_message)
+        save_debug_text([instruction, caption], data_debug_path, data_debug_counter, get_rank())
+        return dict(instruction=instruction, answer=caption)
diff --git a/applications/DeepSpeed-VisualChat/utils/data/dial_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/dial_dataset.py
new file mode 100644
index 000000000..63b99b5ae
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/dial_dataset.py
@@ -0,0 +1,78 @@
+# This file is adapted from https://github.com/open-mmlab/Multimodal-GPT
+# This dataset is from https://llava-vl.github.io/
+import os
+from .vqa_dataset import VQADataset
+import utils.data.DST as DST 
+from utils.utils import get_rank
+from .utils import save_debug_text
+
+class DialDataset(VQADataset):
+    def __init__(self, dataset_name, data_path, data_debug_path, per_sample_image, tokenizer, vis_processor, **kwargs):
+        if dataset_name == "llava_dial":
+            vis_root = f"{data_path}/coco/train2017"
+            assert os.path.isdir(vis_root), f"llava_dial image directory {vis_root} not found, you need to download 2017 Train images from https://cocodataset.org/#download"
+            ann_paths = ["llava/conversation_58k.json"]
+            for idx in range(len(ann_paths)):
+                ann_paths[idx] = f"{data_path}/{ann_paths[idx]}"
+                assert os.path.isfile(ann_paths[idx]), f"llava_dial annotation file {ann_paths[idx]} not found, you need to download it from https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K"
+        super(DialDataset, self).__init__(data_path, data_debug_path, per_sample_image, 
+                                          tokenizer, vis_processor, vis_root,
+                                          ann_paths, **kwargs)
+        self.prompter = DST.Prompter()
+
+    def _add_instance_ids(self, key="id"):
+        for idx, ann in enumerate(self.annotation):
+            ann[key] = str(idx)
+
+    def process_text(self, anns, data_debug_path=None, data_debug_counter=0, first_message=False):
+        num_convs = len(anns["conversations"]) // 2
+        conv_list = []
+        for conv_id in range(num_convs):
+            question = anns["conversations"][int(2*conv_id)]["value"]
+            # remove '<image>' tag and '\n'
+            with_image = "<image>" in question
+            question = question.replace("<image>", "").replace("\n", "")
+            answer = anns["conversations"][int(2*conv_id+1)]["value"]
+            instruction = self.prompter(question, with_image=with_image, first_message=(conv_id == 0 and first_message))
+            single_conv = dict(instruction=instruction, answer=answer)
+            conv_list.append(single_conv)
+        save_debug_text(conv_list, data_debug_path, data_debug_counter, get_rank())
+        return conv_list
+
+    def __getitem__(self, index):
+        full_res_list = []
+        for ann in self.annotation[index]:
+            image = self.process_image(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter)
+            text_list = self.process_text(ann,
+                                        data_debug_path=self.data_debug_path,
+                                        data_debug_counter=self.data_debug_counter,
+                                        first_message=(not full_res_list))
+            self.data_debug_counter += 1
+            res_list = []
+            for text in text_list:
+                single_res = self.tokenize(text)
+                single_res["instruction"] = text["instruction"]
+                single_res["answer"] = text["answer"]
+                res_list.append(single_res)
+            input_ids = []
+            attention_mask = []
+            labels = []
+            instruction = ''
+            answer = ''
+            for res in res_list:
+                input_ids.extend(res["input_ids"])
+                attention_mask.extend(res["attention_mask"])
+                labels.extend(res["labels"])
+                instruction += res["instruction"]
+                answer += res["answer"]
+
+            res = dict(
+                input_ids=input_ids, attention_mask=attention_mask, labels=labels, instruction=instruction, answer=answer
+            )
+            res.update(image=image)
+
+            full_res_list.append(res)
+        output = self.merge_all_images(full_res_list)
+        return output
diff --git a/applications/DeepSpeed-VisualChat/utils/data/llava_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/llava_dataset.py
new file mode 100644
index 000000000..601ecbc4b
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/llava_dataset.py
@@ -0,0 +1,31 @@
+# This file is adapted from https://github.com/open-mmlab/Multimodal-GPT
+# This dataset is from https://llava-vl.github.io/
+import os
+from .vqa_dataset import VQADataset
+from utils.utils import get_rank
+from .utils import save_debug_text
+
+
+class LlavaDataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, tokenizer, vis_processor, **kwargs):
+        vis_root = f"{data_path}/coco/train2017"
+        assert os.path.isdir(vis_root), f"LlavaDataset image directory {vis_root} not found, you need to download 2017 Train images from https://cocodataset.org/#download"
+        ann_paths = ["llava/detail_23k.json", "llava/complex_reasoning_77k.json"]
+        for idx in range(len(ann_paths)):
+            ann_paths[idx] = f"{data_path}/{ann_paths[idx]}"
+            assert os.path.isfile(ann_paths[idx]), f"LlavaDataset annotation file {ann_paths[idx]} not found, you need to download it from https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K"
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         vis_root, ann_paths, **kwargs)
+
+    def _add_instance_ids(self, key="id"):
+        for idx, ann in enumerate(self.annotation):
+            ann[key] = str(idx)
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=False):
+        question = ann["conversations"][0]["value"]
+        # remove '<image>' tag and '\n'
+        question = question.replace("<image>", "").replace("\n", "")
+        answer = ann["conversations"][1]["value"]
+        instruction = self.prompter(question, with_image=True, first_message=first_message)
+        save_debug_text([instruction, answer], data_debug_path, data_debug_counter, get_rank())
+        return dict(instruction=instruction, answer=answer)
diff --git a/applications/DeepSpeed-VisualChat/utils/data/llava_otter_blend_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/llava_otter_blend_dataset.py
new file mode 100644
index 000000000..a35962280
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/llava_otter_blend_dataset.py
@@ -0,0 +1,207 @@
+# This dataset is from https://llava-vl.github.io/ and https://huggingface.co/datasets/pufanyi/MIMICIT
+# This dataset blends llava, llava_dial, and otter_mimicit_cgd datasets, which is possible because
+# all of them use coco images. In each sample of LlavaOtterBlendDataset, there will first have at
+# least one instruction-answer pair from llava/llava_dial, then followed by at least one
+# instruction-answer pair from otter_mimicit_cgd.
+import os
+import torch
+import json
+import random
+from tqdm import tqdm
+from PIL import Image
+from .vqa_dataset import VQADataset
+from utils.utils import print_rank_0, is_rank_0, get_rank
+from .utils import save_debug_image, save_debug_text
+
+
+class LlavaOtterBlendDataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, followup, tokenizer, vis_processor, **kwargs):
+        vis_root = f"{data_path}/coco/train2017"
+        assert os.path.isdir(vis_root), f"LlavaOtterBlendDataset image directory {vis_root} not found, you need to download 2017 Train images from https://cocodataset.org/#download"
+
+        otter_mimicit_cgd = f"{data_path}/MIMIC-IT/CGD_instructions.json"
+        llava = [f"{data_path}/llava/detail_23k.json", f"{data_path}/llava/complex_reasoning_77k.json", f"{data_path}/llava/conversation_58k.json"]
+        ann_path_otter = f"{data_path}/LlavaOtterBlendDataset_instructions_otter.json"
+        ann_path_llava = f"{data_path}/LlavaOtterBlendDataset_instructions_llava.json"
+        if not os.path.isfile(ann_path_llava):
+            print_rank_0(f"LlavaOtterBlendDataset llava annotation file {ann_path_llava} not found, starting an one-time preprocessing:")
+            if is_rank_0():
+                annotations_llava = {}
+                for llava_ann in llava:
+                    assert os.path.isfile(llava_ann), f"LlavaOtterBlendDataset raw annotation file {llava_ann} not found, you need to download it from https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K"
+                    raw_annotation = json.load(open(llava_ann, "r"))
+                    for raw_ann in raw_annotation:
+                        if raw_ann["image"] not in annotations_llava:
+                            annotations_llava[raw_ann["image"]] = []
+                        annotations_llava[raw_ann["image"]].append(raw_ann["conversations"])
+                with open(ann_path_llava, 'w') as f:
+                    json.dump(annotations_llava, f)
+        torch.distributed.barrier()
+        self.ann_llava = json.load(open(ann_path_llava, "r"))
+        if not os.path.isfile(ann_path_otter):
+            print_rank_0(f"LlavaOtterBlendDataset otter annotation file {ann_path_otter} not found, starting an one-time preprocessing:")
+            if is_rank_0():
+                assert os.path.isfile(otter_mimicit_cgd), f"LlavaOtterBlendDataset raw annotation file {otter_mimicit_cgd} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+                raw_annotation = json.load(open(otter_mimicit_cgd, "r"))["data"]
+                raw_annotation_keys = list(raw_annotation.keys())
+                annotations_otter = []
+                for k in tqdm(raw_annotation_keys):
+                    if k in raw_annotation:
+                        ann = {}
+                        ann["image_ids"] = [self.convert_image_id(x) for x in raw_annotation[k]["image_ids"]]
+                        meet_criteria = True
+                        for midx in range(len(ann["image_ids"])-1):
+                            if ann["image_ids"][midx] not in self.ann_llava:
+                                meet_criteria = False
+                        if meet_criteria: # If any image (except the last image) doesn't have llava conversation, we won't be able to build valid sample with correct image order
+                            ann["instruction"] = [raw_annotation[k]["instruction"]]
+                            ann["answer"] = [raw_annotation[k]["answer"]]
+                            rel_ins_ids = raw_annotation[k]["rel_ins_ids"]
+                            for k_rel in rel_ins_ids:
+                                if k_rel in raw_annotation:
+                                    ann["instruction"].append(raw_annotation[k_rel]["instruction"])
+                                    ann["answer"].append(raw_annotation[k_rel]["answer"])
+                                    del raw_annotation[k_rel]
+                            annotations_otter.append(ann)
+                        del raw_annotation[k]
+                with open(ann_path_otter, 'w') as f:
+                    json.dump(annotations_otter, f)
+        torch.distributed.barrier()
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         vis_root, [ann_path_otter], **kwargs)
+        self.followup = followup
+
+    def _add_instance_ids(self, key="id"):
+        for idx, ann in enumerate(self.annotation):
+            ann[key] = str(idx)
+
+    def convert_image_id(self, image_id):
+        return image_id[8:] + ".jpg"
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        images = ann["image_ids"]
+        output_images = []
+        for idx in range(len(images)):
+            image = images[idx]
+            image_path = os.path.join(self.vis_root, image)
+            save_debug_image(image_path, data_debug_path, data_debug_counter, get_rank(), img_idx=idx)
+            image = Image.open(image_path).convert("RGB")
+
+            image = self.vis_processor(image)
+            try:
+                image = image['pixel_values'][0]
+            except:
+                image = image
+            output_images.append(image)
+        
+        return output_images
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=False, num_images=1):
+        images = ann["image_ids"]
+        processed_images = {}
+        conv_list = []
+        # At least one conversation from llava
+        for idx in range(len(images)):
+            img_key = images[idx]
+            if img_key in self.ann_llava:
+                conversations = self.ann_llava[img_key]
+                min_num_draw = 1 if idx < (len(images) - 1) else 0 # The last image could have 0 llava conversation since it won't break image order
+                num_draw = random.randint(min_num_draw, len(conversations))
+                chosen = random.sample(list(range(len(conversations))), num_draw)
+                for cid in chosen:
+                    conv = conversations[cid]
+                    num_convs = len(conv) // 2
+                    for conv_id in range(num_convs):
+                        question = conv[int(2*conv_id)]["value"]
+                        # remove '<image>' tag and '\n'
+                        with_image = img_key not in processed_images
+                        question = question.replace("<image>", "").replace("\n", "")
+                        answer = conv[int(2*conv_id+1)]["value"]
+                        instruction = self.prompter(question, with_image=with_image, first_message=(len(conv_list) == 0 and first_message))
+                        if with_image:
+                            instruction = self.post_process_text_image_count(instruction, 1, offset=len(processed_images))
+                        single_conv = dict(instruction=instruction, answer=answer)
+                        conv_list.append(single_conv)
+                        processed_images[img_key] = 1
+
+        # At least one conversation from otter
+        question_list = ann["instruction"]
+        answer_list = ann["answer"]
+        num_convs = len(question_list)
+        num_draw = random.randint(1, num_convs)
+        chosen = random.sample(list(range(num_convs)), num_draw)
+        for cid in chosen:
+            question = question_list[cid]
+            # remove '<image>' tag and '\n'
+            question = question.replace("<image>", "").replace("\n", "")
+            answer = answer_list[cid]
+            num_images = len(images) - len(processed_images)
+            instruction = self.prompter(question, with_image=(num_images > 0),
+                                        first_message=(len(conv_list) == 0),
+                                        num_images=num_images)
+            if num_images > 0:
+                instruction = self.post_process_text_image_count(instruction, num_images, offset=len(processed_images))
+            single_conv = dict(instruction=instruction, answer=answer)
+            conv_list.append(single_conv)
+            processed_images = images
+        # Follow-up llava conversations
+        if self.followup:
+            image_tags = {0: ["In image 1, ", "In image a, ", "In the first image, "], 1: ["In image 2, ", "In image b, ", "In the second image, "]}
+            for idx in range(len(images)):
+                img_key = images[idx]
+                if img_key in self.ann_llava:
+                    conversations = self.ann_llava[img_key]
+                    # min_num_draw = 1
+                    # num_draw = random.randint(min_num_draw, len(conversations))
+                    num_draw = 1 # To avoid making too complex conversation, we limit num of follow-up conversation to 1 per image
+                    chosen = random.sample(list(range(len(conversations))), num_draw)
+                    for cid in chosen:
+                        conv = conversations[cid]
+                        num_convs = len(conv) // 2
+                        for conv_id in range(num_convs):
+                            question = conv[int(2*conv_id)]["value"]
+                            # remove '<image>' tag and '\n'
+                            question = question.replace("<image>", "").replace("\n", "")
+                            answer = conv[int(2*conv_id+1)]["value"]
+                            # Add image tags so the model knows which image we are referring
+                            chosen_tag = random.choice(image_tags[idx])
+                            question = chosen_tag + question[0].lower() + question[1:]
+                            answer = chosen_tag + answer[0].lower() + answer[1:]
+                            instruction = self.prompter(question, with_image=False, first_message=False)
+                            single_conv = dict(instruction=instruction, answer=answer)
+                            conv_list.append(single_conv)
+        save_debug_text(conv_list, data_debug_path, data_debug_counter, get_rank())
+        return conv_list
+
+    def __getitem__(self, index):
+        ann = self.annotation[index][0] # self.annotation[index] is a list because of "self.annotation = DST.random_grouping(self.annotation, self.per_sample_image)" in VQADataset init
+        images_list = self.process_image(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter)
+        text_list = self.process_text(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter,
+                                    first_message=True,
+                                    num_images=len(images_list))
+
+        self.data_debug_counter += 1
+        res_list = []
+        for text in text_list:
+            single_res = self.tokenize(text)
+            res_list.append(single_res)
+
+        input_ids = []
+        attention_mask = []
+        labels = []
+        for res in res_list:
+            input_ids.extend(res["input_ids"])
+            attention_mask.extend(res["attention_mask"])
+            labels.extend(res["labels"])
+        
+        res = dict(
+            input_ids=input_ids, attention_mask=attention_mask, labels=labels
+        )
+        res.update(image=images_list)
+        res.update(image_num=len(images_list))
+
+        return res
diff --git a/applications/DeepSpeed-VisualChat/utils/data/ocr_vqa_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/ocr_vqa_dataset.py
new file mode 100644
index 000000000..0e57fbb8e
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/ocr_vqa_dataset.py
@@ -0,0 +1,68 @@
+# This file is adapted from https://github.com/open-mmlab/Multimodal-GPT
+# This dataset is from https://ocr-vqa.github.io/
+import json
+import os
+import random
+import torch
+
+from PIL import Image
+from tqdm import tqdm
+
+from .vqa_dataset import VQADataset
+from utils.utils import print_rank_0, is_rank_0, get_rank
+from .utils import save_debug_image, save_debug_text
+
+
+class OCRVQADataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                 add_eos=True, ignore_instruction=True, **kwargs):
+        self.vis_root = f"{data_path}/OCR_VQA/images"
+        assert os.path.isdir(self.vis_root), f"OCRVQADataset image directory {self.vis_root} not found, you need to download images from https://ocr-vqa.github.io/"
+        ann_paths_raw = ["OCR_VQA/dataset.json"]
+        ann_paths = ["OCR_VQA/dataset_processed.json"]
+        real_ann_paths = []
+        for idx in range(len(ann_paths_raw)):
+            ann_path_raw = f"{data_path}/{ann_paths_raw[idx]}"
+            assert os.path.isfile(ann_path_raw), f"OCRVQADataset raw annotation file {ann_path_raw} not found, you need to download it from https://ocr-vqa.github.io/"
+            ann_path = f"{data_path}/{ann_paths[idx]}"
+            real_ann_paths.append(ann_path)
+            if not os.path.isfile(ann_path):
+                print_rank_0(f"OCRVQADataset annotation file {ann_path_raw} not found, starting an one-time preprocessing:")
+                raw_annotation = json.load(open(ann_path_raw, "r"))
+                raw_annotation_keys = list(raw_annotation.keys())
+                for k in tqdm(raw_annotation_keys):
+                    ext=os.path.splitext(raw_annotation[k]['imageURL'])[1]
+                    outputFile = '%s%s'%(k,ext)
+                    image_path = os.path.join(self.vis_root, outputFile)
+                    image = Image.open(image_path).convert("RGB")
+                    if image.size[0] > 1 and image.size[1] > 1:
+                        raw_annotation[k]["filename"] = outputFile
+                    else:
+                        del raw_annotation[k]
+                if is_rank_0():
+                    with open(ann_path, 'w') as f:
+                        json.dump(list(raw_annotation.values()), f)
+            torch.distributed.barrier()
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         self.vis_root, real_ann_paths, **kwargs)
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        image_path = os.path.join(self.vis_root, ann["filename"])
+        save_debug_image(image_path, data_debug_path, data_debug_counter, get_rank(), img_idx=0)
+        image = Image.open(image_path).convert("RGB")
+
+        image = self.vis_processor(image)
+        try:
+            image = image['pixel_values'][0]
+            return image
+        except:
+            return image
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=True):
+        index = random.choice(list(range(len(ann["questions"]))))
+        question = ann["questions"][index]
+        answer = ann["answers"][index]
+
+        instruction = self.prompter(question, with_image=True, first_message=first_message)
+        save_debug_text([instruction, answer], data_debug_path, data_debug_counter, get_rank())
+        return dict(instruction=instruction, answer=answer)
diff --git a/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_cgd_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_cgd_dataset.py
new file mode 100644
index 000000000..53d45551e
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_cgd_dataset.py
@@ -0,0 +1,145 @@
+# This dataset is from https://huggingface.co/datasets/pufanyi/MIMICIT
+import os
+import torch
+import json
+import random
+from tqdm import tqdm
+from PIL import Image
+from .vqa_dataset import VQADataset
+from utils.utils import print_rank_0, is_rank_0, get_rank
+from .utils import save_debug_image, save_debug_text
+
+
+class OtterMimicitCgdDataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, tokenizer, vis_processor, **kwargs):
+        vis_root = f"{data_path}/coco/train2017"
+        assert os.path.isdir(vis_root), f"OtterMimicitCgdDataset image directory {vis_root} not found, you need to download 2017 Train images from https://cocodataset.org/#download"
+        ### Below commented code are the images from the MIMIC-IT. We use the original coco images above which are the same and with higher resolution.
+        # vis_root = f"{data_path}/MIMIC-IT/CGD_images"
+        # if not os.path.isdir(vis_root):
+        #     print_rank_0(f"OtterMimicitCgdDataset image directory {vis_root} not found, starting an one-time preprocessing:")
+        #     vis_root_file = f"{data_path}/MIMIC-IT/CGD.json"
+        #     assert os.path.isfile(vis_root_file), f"OtterMimicitCgdDataset image data {vis_root_file} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+        #     if is_rank_0():
+        #         os.makedirs(vis_root, exist_ok=True)
+        #         image_data = json.load(open(vis_root_file, "r"))
+        #         image_keys = list(image_data.keys())
+        #         for k in tqdm(image_keys):
+        #             image = base64.b64decode(image_data[k])
+        #             with open(f"{vis_root}/{k}.jpg", 'wb') as f:
+        #                 f.write(image)
+        # torch.distributed.barrier()
+
+        ann_paths_raw = ["MIMIC-IT/CGD_instructions.json"]
+        ann_paths = ["MIMIC-IT/CGD_instructions_merged.json"]
+        for idx in range(len(ann_paths)):
+            ann_paths_raw[idx] = f"{data_path}/{ann_paths_raw[idx]}"
+            ann_paths[idx] = f"{data_path}/{ann_paths[idx]}"
+            assert os.path.isfile(ann_paths_raw[idx]), f"OtterMimicitCgdDataset raw annotation file {ann_paths_raw[idx]} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+            if not os.path.isfile(ann_paths[idx]):
+                print_rank_0(f"OtterMimicitCgdDataset annotation file {ann_paths[idx]} not found, starting an one-time preprocessing:")
+                if is_rank_0():
+                    raw_annotation = json.load(open(ann_paths_raw[idx], "r"))["data"]
+                    raw_annotation_keys = list(raw_annotation.keys())
+                    random.shuffle(raw_annotation_keys)
+                    annotations = []
+                    for k in tqdm(raw_annotation_keys):
+                        if k in raw_annotation:
+                            ann = {}
+                            ann["image_ids"] = raw_annotation[k]["image_ids"]
+                            ann["instruction"] = [raw_annotation[k]["instruction"]]
+                            ann["answer"] = [raw_annotation[k]["answer"]]
+                            rel_ins_ids = raw_annotation[k]["rel_ins_ids"]
+                            for k_rel in rel_ins_ids:
+                                if k_rel in raw_annotation:
+                                    ann["instruction"].append(raw_annotation[k_rel]["instruction"])
+                                    ann["answer"].append(raw_annotation[k_rel]["answer"])
+                                    del raw_annotation[k_rel]
+                            annotations.append(ann)
+                            del raw_annotation[k]
+                    with open(ann_paths[idx], 'w') as f:
+                        json.dump(annotations, f)
+            torch.distributed.barrier()
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         vis_root, ann_paths, **kwargs)
+
+    def _add_instance_ids(self, key="id"):
+        for idx, ann in enumerate(self.annotation):
+            ann[key] = str(idx)
+
+    def convert_image_id(self, image_id):
+        return image_id[8:] + ".jpg"
+        # return image_id + ".jpg" ### Change to this if you switch to use images from MIMIC-IT/CGD_images
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        images = ann["image_ids"]
+        output_images = []
+        for idx in range(len(images)):
+            image = images[idx]
+            image_path = os.path.join(self.vis_root, self.convert_image_id(image))
+            save_debug_image(image_path, data_debug_path, data_debug_counter, get_rank(), img_idx=idx)
+            image = Image.open(image_path).convert("RGB")
+
+            image = self.vis_processor(image)
+            try:
+                image = image['pixel_values'][0]
+            except:
+                image = image
+            output_images.append(image)
+        
+        return output_images
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=False, num_images=1):
+        question_list = ann["instruction"]
+        answer_list = ann["answer"]
+        num_convs = len(question_list)
+        indexes = list(range(num_convs))
+        random.shuffle(indexes)
+        conv_list = []
+        for conv_id in range(num_convs):
+            question = question_list[indexes[conv_id]]
+            # remove '<image>' tag and '\n'
+            question = question.replace("<image>", "").replace("\n", "")
+            answer = answer_list[indexes[conv_id]]
+            instruction = self.prompter(question, with_image=(conv_id == 0 and first_message),
+                                        first_message=(conv_id == 0 and first_message),
+                                        num_images=num_images)
+            if conv_id == 0 and first_message:
+                instruction = self.post_process_text_image_count(instruction, num_images)
+            single_conv = dict(instruction=instruction, answer=answer)
+            conv_list.append(single_conv)
+        save_debug_text(conv_list, data_debug_path, data_debug_counter, get_rank())
+        return conv_list
+
+    def __getitem__(self, index):
+        ann = self.annotation[index][0] # self.annotation[index] is a list because of "self.annotation = DST.random_grouping(self.annotation, self.per_sample_image)" in VQADataset init
+        images_list = self.process_image(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter)
+        text_list = self.process_text(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter,
+                                    first_message=True,
+                                    num_images=len(images_list))
+
+        self.data_debug_counter += 1
+        res_list = []
+        for text in text_list:
+            single_res = self.tokenize(text)
+            res_list.append(single_res)
+
+        input_ids = []
+        attention_mask = []
+        labels = []
+        for res in res_list:
+            input_ids.extend(res["input_ids"])
+            attention_mask.extend(res["attention_mask"])
+            labels.extend(res["labels"])
+        
+        res = dict(
+            input_ids=input_ids, attention_mask=attention_mask, labels=labels
+        )
+        res.update(image=images_list)
+        res.update(image_num=len(images_list))
+
+        return res
diff --git a/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_sd_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_sd_dataset.py
new file mode 100644
index 000000000..4bd7740e4
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_sd_dataset.py
@@ -0,0 +1,134 @@
+# This dataset is from https://huggingface.co/datasets/pufanyi/MIMICIT
+import os
+import torch
+import json
+import base64
+import random
+from tqdm import tqdm
+from PIL import Image
+from io import BytesIO
+from .vqa_dataset import VQADataset
+from utils.utils import print_rank_0, is_rank_0, get_rank
+from .utils import save_debug_image, save_debug_text
+
+
+class OtterMimicitSdDataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, tokenizer, vis_processor, **kwargs):
+        vis_root = f"{data_path}/MIMIC-IT/SD.json"
+        assert os.path.isfile(vis_root), f"OtterMimicitSdDataset image data {vis_root} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+        self.vis_root_dict = json.load(open(vis_root, "r"))
+
+        ann_paths_raw = ["MIMIC-IT/SD_instructions.json"]
+        ann_paths = ["MIMIC-IT/SD_instructions_merged.json"]
+        for idx in range(len(ann_paths)):
+            ann_paths_raw[idx] = f"{data_path}/{ann_paths_raw[idx]}"
+            ann_paths[idx] = f"{data_path}/{ann_paths[idx]}"
+            assert os.path.isfile(ann_paths_raw[idx]), f"OtterMimicitSdDataset raw annotation file {ann_paths_raw[idx]} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+            if not os.path.isfile(ann_paths[idx]):
+                print_rank_0(f"OtterMimicitSdDataset annotation file {ann_paths[idx]} not found, starting an one-time preprocessing:")
+                if is_rank_0():
+                    raw_annotation = json.load(open(ann_paths_raw[idx], "r"))["data"]
+                    raw_annotation_keys = list(raw_annotation.keys())
+                    random.shuffle(raw_annotation_keys)
+                    annotations = []
+                    for k in tqdm(raw_annotation_keys):
+                        if k in raw_annotation:
+                            ann = {}
+                            ann["image_ids"] = []
+                            for image in raw_annotation[k]["image_ids"]:
+                                if image in self.vis_root_dict:
+                                    ann["image_ids"].append(image)
+                            if len(ann["image_ids"]) > 0:
+                                ann["instruction"] = [raw_annotation[k]["instruction"]]
+                                ann["answer"] = [raw_annotation[k]["answer"]]
+                                rel_ins_ids = raw_annotation[k]["rel_ins_ids"]
+                                for k_rel in rel_ins_ids:
+                                    if k_rel in raw_annotation:
+                                        ann["instruction"].append(raw_annotation[k_rel]["instruction"])
+                                        ann["answer"].append(raw_annotation[k_rel]["answer"])
+                                        del raw_annotation[k_rel]
+                                annotations.append(ann)
+                            del raw_annotation[k]
+                    with open(ann_paths[idx], 'w') as f:
+                        json.dump(annotations, f)
+            torch.distributed.barrier()
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         vis_root, ann_paths, **kwargs)
+
+    def _add_instance_ids(self, key="id"):
+        for idx, ann in enumerate(self.annotation):
+            ann[key] = str(idx)
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        images = ann["image_ids"]
+        output_images = []
+        for idx in range(len(images)):
+            image = images[idx]
+            image_base64 = base64.b64decode(self.vis_root_dict[image])
+            save_debug_image(image_base64, data_debug_path, data_debug_counter,
+                             get_rank(), img_idx=idx, base64=True)
+            image = Image.open(BytesIO(image_base64)).convert("RGB")
+
+            image = self.vis_processor(image)
+            try:
+                image = image['pixel_values'][0]
+            except:
+                image = image
+            output_images.append(image)
+        
+        return output_images
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=False, num_images=1):
+        question_list = ann["instruction"]
+        answer_list = ann["answer"]
+        num_convs = len(question_list)
+        indexes = list(range(num_convs))
+        random.shuffle(indexes)
+        conv_list = []
+        for conv_id in range(num_convs):
+            question = question_list[indexes[conv_id]]
+            # remove '<image>' tag and '\n'
+            question = question.replace("<image>", "").replace("\n", "")
+            answer = answer_list[indexes[conv_id]]
+            instruction = self.prompter(question, with_image=(conv_id == 0 and first_message),
+                                        first_message=(conv_id == 0 and first_message),
+                                        num_images=num_images)
+            if conv_id == 0 and first_message:
+                instruction = self.post_process_text_image_count(instruction, num_images)
+            single_conv = dict(instruction=instruction, answer=answer)
+            conv_list.append(single_conv)
+        save_debug_text(conv_list, data_debug_path, data_debug_counter, get_rank())
+        return conv_list
+
+    def __getitem__(self, index):
+        ann = self.annotation[index][0] # self.annotation[index] is a list because of "self.annotation = DST.random_grouping(self.annotation, self.per_sample_image)" in VQADataset init
+        images_list = self.process_image(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter)
+        text_list = self.process_text(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter,
+                                    first_message=True,
+                                    num_images=len(images_list))
+
+        self.data_debug_counter += 1
+        res_list = []
+        for text in text_list:
+            single_res = self.tokenize(text)
+            res_list.append(single_res)
+
+        input_ids = []
+        attention_mask = []
+        labels = []
+        for res in res_list:
+            input_ids.extend(res["input_ids"])
+            attention_mask.extend(res["attention_mask"])
+            labels.extend(res["labels"])
+        
+        res = dict(
+            input_ids=input_ids, attention_mask=attention_mask, labels=labels
+        )
+        res.update(image=images_list)
+        res.update(image_num=len(images_list))
+
+        return res
diff --git a/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_sn_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_sn_dataset.py
new file mode 100644
index 000000000..be447119c
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_sn_dataset.py
@@ -0,0 +1,138 @@
+# This dataset is from https://huggingface.co/datasets/pufanyi/MIMICIT
+import os
+import torch
+import json
+import base64
+import random
+from tqdm import tqdm
+from PIL import Image
+from io import BytesIO
+from .vqa_dataset import VQADataset
+from utils.utils import print_rank_0, is_rank_0, get_rank
+from .utils import save_debug_image, save_debug_text
+
+
+class OtterMimicitSnDataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, max_num_image_per_sample, tokenizer, vis_processor, **kwargs):
+        vis_root = f"{data_path}/MIMIC-IT/SN.json"
+        assert os.path.isfile(vis_root), f"OtterMimicitSnDataset image data {vis_root} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+        self.vis_root_dict = json.load(open(vis_root, "r"))
+        self.max_num_image_per_sample = max_num_image_per_sample
+
+        ann_paths_raw = ["MIMIC-IT/SN_instructions.json"]
+        ann_paths = [f"MIMIC-IT/SN_instructions_merged_filtered{max_num_image_per_sample}.json"]
+        for idx in range(len(ann_paths)):
+            ann_paths_raw[idx] = f"{data_path}/{ann_paths_raw[idx]}"
+            ann_paths[idx] = f"{data_path}/{ann_paths[idx]}"
+            assert os.path.isfile(ann_paths_raw[idx]), f"OtterMimicitSnDataset raw annotation file {ann_paths_raw[idx]} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+            if not os.path.isfile(ann_paths[idx]):
+                print_rank_0(f"OtterMimicitSnDataset annotation file {ann_paths[idx]} not found, starting an one-time preprocessing:")
+                if is_rank_0():
+                    raw_annotation = json.load(open(ann_paths_raw[idx], "r"))["data"]
+                    raw_annotation_keys = list(raw_annotation.keys())
+                    random.shuffle(raw_annotation_keys)
+                    annotations = []
+                    for k in tqdm(raw_annotation_keys):
+                        if k in raw_annotation:
+                            ann = {}
+                            ann["image_ids"] = []
+                            for image in raw_annotation[k]["image_ids"]:
+                                if image in self.vis_root_dict:
+                                    ann["image_ids"].append(image)
+                            if len(ann["image_ids"]) > 0 and len(ann["image_ids"]) <= max_num_image_per_sample:
+                                ann["instruction"] = [raw_annotation[k]["instruction"]]
+                                ann["answer"] = [raw_annotation[k]["answer"]]
+                                rel_ins_ids = raw_annotation[k]["rel_ins_ids"]
+                                for k_rel in rel_ins_ids:
+                                    if k_rel in raw_annotation:
+                                        ann["instruction"].append(raw_annotation[k_rel]["instruction"])
+                                        ann["answer"].append(raw_annotation[k_rel]["answer"])
+                                        del raw_annotation[k_rel]
+                                annotations.append(ann)
+                            del raw_annotation[k]
+                    with open(ann_paths[idx], 'w') as f:
+                        json.dump(annotations, f)
+            torch.distributed.barrier()
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         vis_root, ann_paths, **kwargs)
+
+    def _add_instance_ids(self, key="id"):
+        for idx, ann in enumerate(self.annotation):
+            ann[key] = str(idx)
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        images = ann["image_ids"]
+        chosen = list(range(len(images)))
+        if len(images) > self.max_num_image_per_sample:
+            chosen = list(sorted(random.sample(chosen, self.max_num_image_per_sample)))
+        output_images = []
+        for idx in chosen:
+            image = images[idx]
+            image_base64 = base64.b64decode(self.vis_root_dict[image])
+            save_debug_image(image_base64, data_debug_path, data_debug_counter,
+                             get_rank(), img_idx=idx, base64=True)
+            image = Image.open(BytesIO(image_base64)).convert("RGB")
+
+            image = self.vis_processor(image)
+            try:
+                image = image['pixel_values'][0]
+            except:
+                image = image
+            output_images.append(image)
+        
+        return output_images
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=False, num_images=1):
+        question_list = ann["instruction"]
+        answer_list = ann["answer"]
+        num_convs = len(question_list)
+        indexes = list(range(num_convs))
+        random.shuffle(indexes)
+        conv_list = []
+        for conv_id in range(num_convs):
+            question = question_list[indexes[conv_id]]
+            # remove '<image>' tag and '\n'
+            question = question.replace("<image>", "").replace("\n", "")
+            answer = answer_list[indexes[conv_id]]
+            instruction = self.prompter(question, with_image=(conv_id == 0 and first_message),
+                                        first_message=(conv_id == 0 and first_message),
+                                        num_images=num_images)
+            if conv_id == 0 and first_message:
+                instruction = self.post_process_text_image_count(instruction, num_images)
+            single_conv = dict(instruction=instruction, answer=answer)
+            conv_list.append(single_conv)
+        save_debug_text(conv_list, data_debug_path, data_debug_counter, get_rank())
+        return conv_list
+
+    def __getitem__(self, index):
+        ann = self.annotation[index][0] # self.annotation[index] is a list because of "self.annotation = DST.random_grouping(self.annotation, self.per_sample_image)" in VQADataset init
+        images_list = self.process_image(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter)
+        text_list = self.process_text(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter,
+                                    first_message=True,
+                                    num_images=len(images_list))
+
+        self.data_debug_counter += 1
+        res_list = []
+        for text in text_list:
+            single_res = self.tokenize(text)
+            res_list.append(single_res)
+
+        input_ids = []
+        attention_mask = []
+        labels = []
+        for res in res_list:
+            input_ids.extend(res["input_ids"])
+            attention_mask.extend(res["attention_mask"])
+            labels.extend(res["labels"])
+        
+        res = dict(
+            input_ids=input_ids, attention_mask=attention_mask, labels=labels
+        )
+        res.update(image=images_list)
+        res.update(image_num=len(images_list))
+
+        return res
diff --git a/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_tvc_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_tvc_dataset.py
new file mode 100644
index 000000000..09d1c5b88
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_tvc_dataset.py
@@ -0,0 +1,138 @@
+# This dataset is from https://huggingface.co/datasets/pufanyi/MIMICIT
+import os
+import torch
+import json
+import base64
+import random
+from tqdm import tqdm
+from PIL import Image
+from io import BytesIO
+from .vqa_dataset import VQADataset
+from utils.utils import print_rank_0, is_rank_0, get_rank
+from .utils import save_debug_image, save_debug_text
+
+
+class OtterMimicitTvcDataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, max_num_image_per_sample, tokenizer, vis_processor, **kwargs):
+        vis_root = f"{data_path}/MIMIC-IT/TVC.json"
+        assert os.path.isfile(vis_root), f"OtterMimicitTvcDataset image data {vis_root} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+        self.vis_root_dict = json.load(open(vis_root, "r"))
+        self.max_num_image_per_sample = max_num_image_per_sample
+
+        ann_paths_raw = ["MIMIC-IT/TVC_instructions.json"]
+        ann_paths = [f"MIMIC-IT/TVC_instructions_merged_filtered{max_num_image_per_sample}.json"]
+        for idx in range(len(ann_paths)):
+            ann_paths_raw[idx] = f"{data_path}/{ann_paths_raw[idx]}"
+            ann_paths[idx] = f"{data_path}/{ann_paths[idx]}"
+            assert os.path.isfile(ann_paths_raw[idx]), f"OtterMimicitTvcDataset raw annotation file {ann_paths_raw[idx]} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+            if not os.path.isfile(ann_paths[idx]):
+                print_rank_0(f"OtterMimicitTvcDataset annotation file {ann_paths[idx]} not found, starting an one-time preprocessing:")
+                if is_rank_0():
+                    raw_annotation = json.load(open(ann_paths_raw[idx], "r"))["data"]
+                    raw_annotation_keys = list(raw_annotation.keys())
+                    random.shuffle(raw_annotation_keys)
+                    annotations = []
+                    for k in tqdm(raw_annotation_keys):
+                        if k in raw_annotation:
+                            ann = {}
+                            ann["image_ids"] = []
+                            for image in raw_annotation[k]["image_ids"]:
+                                if image in self.vis_root_dict:
+                                    ann["image_ids"].append(image)
+                            if len(ann["image_ids"]) > 0 and len(ann["image_ids"]) <= max_num_image_per_sample:
+                                ann["instruction"] = [raw_annotation[k]["instruction"]]
+                                ann["answer"] = [raw_annotation[k]["answer"]]
+                                rel_ins_ids = raw_annotation[k]["rel_ins_ids"]
+                                for k_rel in rel_ins_ids:
+                                    if k_rel in raw_annotation:
+                                        ann["instruction"].append(raw_annotation[k_rel]["instruction"])
+                                        ann["answer"].append(raw_annotation[k_rel]["answer"])
+                                        del raw_annotation[k_rel]
+                                annotations.append(ann)
+                            del raw_annotation[k]
+                    with open(ann_paths[idx], 'w') as f:
+                        json.dump(annotations, f)
+            torch.distributed.barrier()
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         vis_root, ann_paths, **kwargs)
+
+    def _add_instance_ids(self, key="id"):
+        for idx, ann in enumerate(self.annotation):
+            ann[key] = str(idx)
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        images = ann["image_ids"]
+        chosen = list(range(len(images)))
+        if len(images) > self.max_num_image_per_sample:
+            chosen = list(sorted(random.sample(chosen, self.max_num_image_per_sample)))
+        output_images = []
+        for idx in chosen:
+            image = images[idx]
+            image_base64 = base64.b64decode(self.vis_root_dict[image])
+            save_debug_image(image_base64, data_debug_path, data_debug_counter,
+                             get_rank(), img_idx=idx, base64=True)
+            image = Image.open(BytesIO(image_base64)).convert("RGB")
+
+            image = self.vis_processor(image)
+            try:
+                image = image['pixel_values'][0]
+            except:
+                image = image
+            output_images.append(image)
+        
+        return output_images
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=False, num_images=1):
+        question_list = ann["instruction"]
+        answer_list = ann["answer"]
+        num_convs = len(question_list)
+        indexes = list(range(num_convs))
+        random.shuffle(indexes)
+        conv_list = []
+        for conv_id in range(num_convs):
+            question = question_list[indexes[conv_id]]
+            # remove '<image>' tag and '\n'
+            question = question.replace("<image>", "").replace("\n", "")
+            answer = answer_list[indexes[conv_id]]
+            instruction = self.prompter(question, with_image=(conv_id == 0 and first_message),
+                                        first_message=(conv_id == 0 and first_message),
+                                        num_images=num_images)
+            if conv_id == 0 and first_message:
+                instruction = self.post_process_text_image_count(instruction, num_images)
+            single_conv = dict(instruction=instruction, answer=answer)
+            conv_list.append(single_conv)
+        save_debug_text(conv_list, data_debug_path, data_debug_counter, get_rank())
+        return conv_list
+
+    def __getitem__(self, index):
+        ann = self.annotation[index][0] # self.annotation[index] is a list because of "self.annotation = DST.random_grouping(self.annotation, self.per_sample_image)" in VQADataset init
+        images_list = self.process_image(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter)
+        text_list = self.process_text(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter,
+                                    first_message=True,
+                                    num_images=len(images_list))
+
+        self.data_debug_counter += 1
+        res_list = []
+        for text in text_list:
+            single_res = self.tokenize(text)
+            res_list.append(single_res)
+
+        input_ids = []
+        attention_mask = []
+        labels = []
+        for res in res_list:
+            input_ids.extend(res["input_ids"])
+            attention_mask.extend(res["attention_mask"])
+            labels.extend(res["labels"])
+        
+        res = dict(
+            input_ids=input_ids, attention_mask=attention_mask, labels=labels
+        )
+        res.update(image=images_list)
+        res.update(image_num=len(images_list))
+
+        return res
diff --git a/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_vst_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_vst_dataset.py
new file mode 100644
index 000000000..435c15141
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/otter_mimicit_vst_dataset.py
@@ -0,0 +1,138 @@
+# This dataset is from https://huggingface.co/datasets/pufanyi/MIMICIT
+import os
+import torch
+import json
+import base64
+import random
+from tqdm import tqdm
+from PIL import Image
+from io import BytesIO
+from .vqa_dataset import VQADataset
+from utils.utils import print_rank_0, is_rank_0, get_rank
+from .utils import save_debug_image, save_debug_text
+
+
+class OtterMimicitVstDataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, max_num_image_per_sample, tokenizer, vis_processor, **kwargs):
+        vis_root = f"{data_path}/MIMIC-IT/VST.json"
+        assert os.path.isfile(vis_root), f"OtterMimicitVstDataset image data {vis_root} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+        self.vis_root_dict = json.load(open(vis_root, "r"))
+        self.max_num_image_per_sample = max_num_image_per_sample
+
+        ann_paths_raw = ["MIMIC-IT/VST_instructions.json"]
+        ann_paths = [f"MIMIC-IT/VST_instructions_merged_filtered{max_num_image_per_sample}.json"]
+        for idx in range(len(ann_paths)):
+            ann_paths_raw[idx] = f"{data_path}/{ann_paths_raw[idx]}"
+            ann_paths[idx] = f"{data_path}/{ann_paths[idx]}"
+            assert os.path.isfile(ann_paths_raw[idx]), f"OtterMimicitVstDataset raw annotation file {ann_paths_raw[idx]} not found, you need to download it from https://huggingface.co/datasets/pufanyi/MIMICIT"
+            if not os.path.isfile(ann_paths[idx]):
+                print_rank_0(f"OtterMimicitVstDataset annotation file {ann_paths[idx]} not found, starting an one-time preprocessing:")
+                if is_rank_0():
+                    raw_annotation = json.load(open(ann_paths_raw[idx], "r"))["data"]
+                    raw_annotation_keys = list(raw_annotation.keys())
+                    random.shuffle(raw_annotation_keys)
+                    annotations = []
+                    for k in tqdm(raw_annotation_keys):
+                        if k in raw_annotation:
+                            ann = {}
+                            ann["image_ids"] = []
+                            for image in raw_annotation[k]["image_ids"]:
+                                if image in self.vis_root_dict:
+                                    ann["image_ids"].append(image)
+                            if len(ann["image_ids"]) > 0 and len(ann["image_ids"]) <= max_num_image_per_sample:
+                                ann["instruction"] = [raw_annotation[k]["instruction"]]
+                                ann["answer"] = [raw_annotation[k]["answer"]]
+                                rel_ins_ids = raw_annotation[k]["rel_ins_ids"]
+                                for k_rel in rel_ins_ids:
+                                    if k_rel in raw_annotation:
+                                        ann["instruction"].append(raw_annotation[k_rel]["instruction"])
+                                        ann["answer"].append(raw_annotation[k_rel]["answer"])
+                                        del raw_annotation[k_rel]
+                                annotations.append(ann)
+                            del raw_annotation[k]
+                    with open(ann_paths[idx], 'w') as f:
+                        json.dump(annotations, f)
+            torch.distributed.barrier()
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         vis_root, ann_paths, **kwargs)
+
+    def _add_instance_ids(self, key="id"):
+        for idx, ann in enumerate(self.annotation):
+            ann[key] = str(idx)
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        images = ann["image_ids"]
+        chosen = list(range(len(images)))
+        if len(images) > self.max_num_image_per_sample:
+            chosen = list(sorted(random.sample(chosen, self.max_num_image_per_sample)))
+        output_images = []
+        for idx in chosen:
+            image = images[idx]
+            image_base64 = base64.b64decode(self.vis_root_dict[image])
+            save_debug_image(image_base64, data_debug_path, data_debug_counter,
+                             get_rank(), img_idx=idx, base64=True)
+            image = Image.open(BytesIO(image_base64)).convert("RGB")
+
+            image = self.vis_processor(image)
+            try:
+                image = image['pixel_values'][0]
+            except:
+                image = image
+            output_images.append(image)
+        
+        return output_images
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=False, num_images=1):
+        question_list = ann["instruction"]
+        answer_list = ann["answer"]
+        num_convs = len(question_list)
+        indexes = list(range(num_convs))
+        random.shuffle(indexes)
+        conv_list = []
+        for conv_id in range(num_convs):
+            question = question_list[indexes[conv_id]]
+            # remove '<image>' tag and '\n'
+            question = question.replace("<image>", "").replace("\n", "")
+            answer = answer_list[indexes[conv_id]]
+            instruction = self.prompter(question, with_image=(conv_id == 0 and first_message),
+                                        first_message=(conv_id == 0 and first_message),
+                                        num_images=num_images)
+            if conv_id == 0 and first_message:
+                instruction = self.post_process_text_image_count(instruction, num_images)
+            single_conv = dict(instruction=instruction, answer=answer)
+            conv_list.append(single_conv)
+        save_debug_text(conv_list, data_debug_path, data_debug_counter, get_rank())
+        return conv_list
+
+    def __getitem__(self, index):
+        ann = self.annotation[index][0] # self.annotation[index] is a list because of "self.annotation = DST.random_grouping(self.annotation, self.per_sample_image)" in VQADataset init
+        images_list = self.process_image(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter)
+        text_list = self.process_text(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter,
+                                    first_message=True,
+                                    num_images=len(images_list))
+
+        self.data_debug_counter += 1
+        res_list = []
+        for text in text_list:
+            single_res = self.tokenize(text)
+            res_list.append(single_res)
+
+        input_ids = []
+        attention_mask = []
+        labels = []
+        for res in res_list:
+            input_ids.extend(res["input_ids"])
+            attention_mask.extend(res["attention_mask"])
+            labels.extend(res["labels"])
+        
+        res = dict(
+            input_ids=input_ids, attention_mask=attention_mask, labels=labels
+        )
+        res.update(image=images_list)
+        res.update(image_num=len(images_list))
+
+        return res
diff --git a/applications/DeepSpeed-VisualChat/utils/data/sparkles_dialogue_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/sparkles_dialogue_dataset.py
new file mode 100644
index 000000000..d11fcfa97
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/sparkles_dialogue_dataset.py
@@ -0,0 +1,161 @@
+# This dataset is from https://github.com/HYPJUDY/Sparkles
+import os
+import torch
+import json
+import random
+import re
+from PIL import Image
+from .vqa_dataset import VQADataset
+from utils.utils import print_rank_0, is_rank_0, get_rank
+from .utils import save_debug_image, save_debug_text
+
+
+class SparklesDialogueDataset(VQADataset):
+    def __init__(self, data_path, data_debug_path, per_sample_image, tokenizer, vis_processor, **kwargs):
+        vis_root = ["SparklesDialogueCC/images", "SparklesDialogueVG/images"]
+        for idx in range(len(vis_root)):
+            vis_root[idx] = f"{data_path}/{vis_root[idx]}"
+            assert os.path.isdir(vis_root[idx]), f"SparklesDialogueDataset image directory {vis_root[idx]} not found, you need to download it from https://github.com/HYPJUDY/Sparkles"
+
+        ann_path_raw = ["SparklesDialogueCC/annotations/SparklesDialogueCC.json",
+                        "SparklesDialogueVG/annotations/SparklesDialogueVG.json"]
+        for idx in range(len(ann_path_raw)):
+            ann_path_raw[idx] = f"{data_path}/{ann_path_raw[idx]}"
+            assert os.path.isfile(ann_path_raw[idx]), f"SparklesDialogueDataset annotation file {ann_path_raw[idx]} not found, you need to download it from https://github.com/HYPJUDY/Sparkles"
+        ann_path = f"{data_path}/SparklesDialogue.json"
+        
+        if not os.path.isfile(ann_path):
+            print_rank_0(f"SparklesDialogueDataset: starting an one-time preprocessing:")
+            if is_rank_0():
+                annotations = []
+                for a_idx in range(len(ann_path_raw)):
+                    raw_annotation = json.load(open(ann_path_raw[a_idx], "r"))
+                    for raw_ann in raw_annotation:
+                        meet_criteria = True
+                        if len(raw_ann["dialogue"]) % 2 != 0:
+                            meet_criteria = False
+                        raw_ann["image_path"] = vis_root[a_idx]
+                        num_img = 0
+                        for d_idx in range(len(raw_ann["dialogue"])):
+                            if d_idx % 2 == 0 and raw_ann["dialogue"][d_idx]["role"] != "user":
+                                meet_criteria = False
+                            if d_idx % 2 == 1 and raw_ann["dialogue"][d_idx]["role"] != "assistant":
+                                meet_criteria = False
+                            if "images" in raw_ann["dialogue"][d_idx]:
+                                for img in raw_ann["dialogue"][d_idx]["images"]:
+                                    img_id = img["image_id"]
+                                    num_img += 1
+                                    if not os.path.isfile(f"{vis_root[a_idx]}/{img_id}.jpg"):
+                                        meet_criteria = False
+                        if num_img > 8: # Currently only use conversations with <= 8 images
+                            meet_criteria = False
+                        if meet_criteria:
+                            annotations.append(raw_ann)
+                with open(ann_path, 'w') as f:
+                    json.dump(annotations, f)
+            torch.distributed.barrier()
+        super().__init__(data_path, data_debug_path, per_sample_image, tokenizer, vis_processor,
+                         vis_root, [ann_path], **kwargs)
+        self.image_tag_dict = [{0: "image a", 1: "image b", 2: "image c", 3: "image d", 4: "image e", 5: "image f", 6: "image g", 7: "image h"},
+                               {0: "image A", 1: "image B", 2: "image C", 3: "image D", 4: "image E", 5: "image F", 6: "image G", 7: "image H"},
+                               {0: "the first image", 1: "the second image", 2: "the third image", 3: "the fourth image",
+                                4: "the fifth image", 5: "the sixth image", 6: "the seventh image", 7: "the eighth image"}]
+
+    def _add_instance_ids(self, key="id"):
+        for idx, ann in enumerate(self.annotation):
+            ann[key] = str(idx)
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        output_images = []
+        img_counter = 0
+        for dialogue in ann["dialogue"]:
+            if "images" in dialogue:
+                for img in dialogue["images"]:
+                    image_path = os.path.join(ann["image_path"], str(img["image_id"]) + ".jpg")
+                    save_debug_image(image_path, data_debug_path, data_debug_counter,
+                                     get_rank(), img_idx=img_counter)
+                    img_counter += 1
+                    image = Image.open(image_path).convert("RGB")
+
+                    image = self.vis_processor(image)
+                    try:
+                        image = image['pixel_values'][0]
+                    except:
+                        image = image
+                    output_images.append(image)
+        
+        return output_images
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=False, num_images=1):
+        tag_dict = random.choice(self.image_tag_dict)
+        regex = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
+        def capitalize_sentence(match):
+            return(match.group().capitalize())
+        to_replace = []
+        conv_list = []
+        num_convs = len(ann["dialogue"]) // 2
+        tot_num_image = 0
+        for conv_id in range(num_convs):
+            with_image = False
+            num_image = 0
+            if "images" in ann["dialogue"][int(2*conv_id)]:
+                with_image = True
+                for img in ann["dialogue"][int(2*conv_id)]["images"]:
+                    img_id = img["image_id"]
+                    tag_replace = [f"IMAGE#{img_id}", tag_dict[len(to_replace)]]
+                    to_replace.append(tag_replace)
+                    num_image += 1
+            question = ann["dialogue"][int(2*conv_id)]["content"]
+            # remove '<Img>' tag and '\n'
+            question = question.replace("<Img><ImageHere></Img>", "").replace("\n", "")
+            answer = ann["dialogue"][int(2*conv_id+1)]["content"]
+            for idx in range(len(to_replace)):
+                question = question.replace(to_replace[idx][0], f"%temp{idx}%")
+                answer = answer.replace(to_replace[idx][0], f"%temp{idx}%")
+            for idx in range(len(to_replace)):
+                question = question.replace(f"%temp{idx}%", to_replace[idx][1])
+                answer = answer.replace(f"%temp{idx}%", to_replace[idx][1])
+            question = regex.sub(capitalize_sentence, question)
+            answer = regex.sub(capitalize_sentence, answer)
+            instruction = self.prompter(question, with_image=with_image, first_message=(len(conv_list) == 0 and first_message), num_images=num_image)
+            if with_image:
+                instruction = self.post_process_text_image_count(instruction, num_image, offset=tot_num_image)
+            single_conv = dict(instruction=instruction, answer=answer)
+            conv_list.append(single_conv)
+            tot_num_image += num_image
+
+        save_debug_text(conv_list, data_debug_path, data_debug_counter, get_rank())
+        return conv_list
+
+    def __getitem__(self, index):
+        ann = self.annotation[index][0] # self.annotation[index] is a list because of "self.annotation = DST.random_grouping(self.annotation, self.per_sample_image)" in VQADataset init
+        images_list = self.process_image(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter)
+        text_list = self.process_text(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter,
+                                    first_message=True,
+                                    num_images=len(images_list))
+
+        self.data_debug_counter += 1
+        res_list = []
+        for text in text_list:
+            single_res = self.tokenize(text)
+            res_list.append(single_res)
+
+        input_ids = []
+        attention_mask = []
+        labels = []
+        for res in res_list:
+            input_ids.extend(res["input_ids"])
+            attention_mask.extend(res["attention_mask"])
+            labels.extend(res["labels"])
+        
+        res = dict(
+            input_ids=input_ids, attention_mask=attention_mask, labels=labels
+        )
+        res.update(image=images_list)
+        res.update(image_num=len(images_list))
+
+        return res
diff --git a/applications/DeepSpeed-VisualChat/utils/data/utils.py b/applications/DeepSpeed-VisualChat/utils/data/utils.py
new file mode 100644
index 000000000..e63b94173
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/utils.py
@@ -0,0 +1,64 @@
+import torch
+from torch.utils.data import Subset
+from torch.nn.utils.rnn import pad_sequence
+import numpy as np
+import shutil
+from torch.utils.data.dataloader import default_collate
+import utils.data.DST as DST
+
+NUM_DEBUG_SAMPLE = 10
+
+def split_dataset(dataset, split_ratio=0.8):
+    split = int(len(dataset) * split_ratio)
+    return Subset(dataset, range(split)), Subset(dataset, range(split, len(dataset)))
+
+def shuffle_dataset(dataset, np_rng):
+    size = len(dataset)
+    dtype_ = np.uint32
+    if size >= (np.iinfo(np.uint32).max - 1):
+        dtype_ = np.int64
+    shuffle_idx = np.arange(start=0, stop=size, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx)
+    return Subset(dataset, shuffle_idx.tolist())
+
+def save_debug_image(image_path, data_debug_path, data_debug_counter, rank, img_idx=0, base64=False):
+    if data_debug_path is not None and data_debug_counter < NUM_DEBUG_SAMPLE:
+        if base64:
+            with open(f"{data_debug_path}/gpu_rank{rank}_debug{data_debug_counter}_image{img_idx}.jpg", 'wb') as f:
+                f.write(image_path)
+        else:
+            shutil.copyfile(
+                image_path,
+                f"{data_debug_path}/gpu_rank{rank}_debug{data_debug_counter}_image{img_idx}.jpg")
+
+def save_debug_text(text_to_save, data_debug_path, data_debug_counter, rank):
+    if data_debug_path is not None and data_debug_counter < NUM_DEBUG_SAMPLE:
+        with open(f"{data_debug_path}/gpu_rank{rank}_debug{data_debug_counter}_text.txt", 'w') as f:
+            f.write(f"{text_to_save}")
+
+class DataCollatorPadToMaxLen:
+
+    def __init__(self, max_token_len, pad_token_id):
+        self.max_token_len = max_token_len
+        self.pad_token_id = pad_token_id
+
+    def __call__(self, data):
+        batch = {}
+        input_ids = pad_sequence([default_collate(f['input_ids']) for f in data], 
+                                  padding_value=self.pad_token_id, 
+                                  batch_first=True)
+        
+        labels = pad_sequence([default_collate(f['labels']) for f in data],
+                                   padding_value=DST.DEFAULT_LABEL_PADDING_NUM,
+                                   batch_first=True)
+        attention_mask = pad_sequence([default_collate(f['attention_mask']) for f in data],
+                                        padding_value=0,
+                                        batch_first=True)
+        image = torch.concat([default_collate(f['image']) for f in data], dim=0).reshape((-1,) + data[0]["image"][0].shape[-3:])
+        image_num = [f['image_num'] for f in data] 
+        batch['input_ids'] = input_ids
+        batch['labels'] = labels
+        batch['attention_mask'] = attention_mask
+        batch['image'] = image
+        batch['image_num'] = image_num
+        return batch
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/utils/data/vqa_dataset.py b/applications/DeepSpeed-VisualChat/utils/data/vqa_dataset.py
new file mode 100755
index 000000000..f35b01642
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/data/vqa_dataset.py
@@ -0,0 +1,294 @@
+# This file is adapted from https://github.com/open-mmlab/Multimodal-GPT
+
+"""
+ Copyright (c) 2022, salesforce.com, inc.
+ All rights reserved.
+ SPDX-License-Identifier: BSD-3-Clause
+ For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+
+import copy
+import json
+import os
+import random
+from collections import defaultdict
+from typing import Iterable
+
+import numpy as np
+import torch
+from PIL import Image
+from torch.utils.data import ConcatDataset, Dataset
+from transformers import LlamaTokenizer
+import utils.data.DST as DST 
+from utils.utils import get_rank
+from .utils import save_debug_image, save_debug_text
+import re
+
+class VQADataset(Dataset):
+    def __init__(
+        self,
+        data_path,
+        data_debug_path,
+        per_sample_image,
+        tokenizer,
+        vis_processor=None,
+        vis_root=None,
+        ann_paths=[],
+        add_eos=True,
+        ignore_instruction=True,
+        sample_image=False,
+        annotation_key=None
+    ):
+        """
+        vis_root (string): Root directory of images (e.g. coco/images/)
+        ann_root (string): directory to store the annotation file
+        """
+        if hasattr(tokenizer, "add_eos_token"):
+            assert tokenizer.add_eos_token is False, "tokenizer should not add eos token by default"
+        self.tokenizer: LlamaTokenizer = tokenizer
+        self.data_path = data_path
+        self.data_debug_path = data_debug_path
+        self.data_debug_counter = 0
+        self.vis_root = vis_root
+        self.per_sample_image = per_sample_image
+        print('check tokenizer',  self.tokenizer)
+        self.annotation = []
+        for ann_path in ann_paths:
+            if annotation_key is None:
+                self.annotation.extend(json.load(open(ann_path, "r")))
+            else:
+                self.annotation.extend(json.load(open(ann_path, "r"))[annotation_key])
+        self.sample_image = sample_image
+        if self.sample_image:
+            print("randomly sample one annotation for each image") 
+            self.annotation = self.parse_annotation(self.annotation)
+
+        self.annotation = DST.random_grouping(self.annotation, self.per_sample_image)
+
+        self.vis_processor = vis_processor
+
+        self.option_prob = 0.5
+        self.prompter = DST.Prompter()
+        self.add_eos = add_eos
+        self.ignore_instruction = ignore_instruction
+        self.system_instruct = None
+        self.image_token_dict = DST.get_image_num_map(self.tokenizer)
+        self.cat_number()
+
+    def parse_annotation(self, annotation):
+        image_list = defaultdict(list)
+        for ann in annotation:
+            image_list[ann["image"]].append(ann)
+            
+        annotation = []
+        for ann_list in image_list.values():
+            annotation.append(random.choice(ann_list))
+        
+        return annotation
+
+    def __len__(self):
+        return len(self.annotation)
+
+    def cat_number(self):
+        tmp = len(self.annotation) // self.per_sample_image
+        self.arithmetic_progression_multi_image = [tmp * i for i in range(self.per_sample_image)]
+
+    def _add_instance_ids(self, key="instance_id"):
+        for idx, ann in enumerate(self.annotation):
+            ann[key] = str(idx)
+
+    def process_image(self, ann, data_debug_path=None, data_debug_counter=0):
+        image_path = os.path.join(self.vis_root, ann["image"])
+        save_debug_image(image_path, data_debug_path, data_debug_counter, get_rank(), img_idx=0)
+        image = Image.open(image_path).convert("RGB")
+
+        image = self.vis_processor(image)
+        try:
+            image = image['pixel_values'][0]
+            return image
+        except:
+            return image
+    
+    def post_process_text_image_count(self, text, image_num, offset=0):
+        for i in range(1+offset, image_num+1+offset):
+            text = re.sub(DST.DEFAULT_HUMAN_IMAGE_PRETOKEN, DST.image_mapping_dict[f"{i}"], text, count=1)
+        return text
+
+    def process_text(self, ann, data_debug_path=None, data_debug_counter=0, first_message=False):
+        question = ann["question"]
+
+        answer_weight = {}
+        for answer in ann["answer"]:
+            if answer in answer_weight.keys():
+                answer_weight[answer] += 1 / len(ann["answer"])
+            else:
+                answer_weight[answer] = 1 / len(ann["answer"])
+
+        answers = list(answer_weight.keys())
+        weights = list(answer_weight.values())
+
+        # create instruction
+        true_answer = answers[np.argmax(weights)]
+        is_option = random.random() < self.option_prob and len(answers) > 1
+        if is_option:
+            instruction = self.prompter(question, answers)
+        else:
+            instruction = self.prompter(question, with_image=True, first_message=first_message)
+        save_debug_text([instruction, true_answer], data_debug_path, data_debug_counter, get_rank())
+        return dict(instruction=instruction, answer=true_answer)
+
+    def tokenize(self, text):
+        res = self.tokenizer(
+            text["instruction"] + text["answer"],
+            return_tensors=None,
+            padding="do_not_pad",
+            truncation=True,
+            max_length=512,
+        )
+        if res["input_ids"][-1] != self.tokenizer.eos_token_id and self.add_eos:
+            res["input_ids"].append(self.tokenizer.eos_token_id)
+            res["attention_mask"].append(1)
+
+        labels = copy.deepcopy(res["input_ids"])
+        # ignore instruction_token
+        if self.ignore_instruction:
+            instruction_token = self.tokenizer(
+                text["instruction"], return_tensors=None, padding="do_not_pad", truncation=True, max_length=512
+            )
+            labels = [DST.DEFAULT_LABEL_PADDING_NUM] * len(instruction_token["input_ids"]) + labels[len(instruction_token["input_ids"]) :]
+
+        res.update(labels=labels)
+        return res
+
+
+    def create_system_instruct(self):
+        system_instruct = self.tokenizer(
+            DST.DEFAULT_PROMPT,
+            return_tensors=None,
+            padding="do_not_pad",
+            truncation=False,
+        )
+        # create the system instruction
+        self.system_instruct = {
+            "input_ids": system_instruct["input_ids"] + [self.tokenizer.eos_token_id],
+            "attention_mask": system_instruct["attention_mask"] + [1],
+            "labels": (len(system_instruct["input_ids"]) + 1) * [DST.DEFAULT_LABEL_PADDING_NUM],
+        }
+
+    def merge_all_images(self, res_list):
+        def find_index_and_replace(input_list, attention_mask_list, labels_list, image_number):
+            # replace a single number with a list of numbers
+            index = input_list.index(self.image_token_dict[DST.DEFAULT_HUMAN_IMAGE_PRETOKEN])
+            input_list[index] = self.image_token_dict[DST.image_mapping_dict[str(image_number)]]
+            attention_mask_list[index] = [1] * len(self.image_token_dict[DST.image_mapping_dict[str(image_number)]])
+            labels_list[index] = [DST.DEFAULT_LABEL_PADDING_NUM] * len(self.image_token_dict[DST.image_mapping_dict[str(image_number)]])
+            # flatten nested list
+            input_list = DST.flatten(input_list)
+            attention_mask_list = DST.flatten(attention_mask_list)
+            labels_list = DST.flatten(labels_list)
+            return input_list, attention_mask_list, labels_list
+        image_number = 0 
+        original_output = {"input_ids": [], "attention_mask": [], "labels": [], "image": []} #copy.deepcopy(self.system_instruct)
+        # original_output["image"] = []
+        for res in res_list:
+            # need to check if it has image or not
+            if self.image_token_dict[DST.DEFAULT_HUMAN_IMAGE_PRETOKEN] in res["input_ids"]:
+                image_number += 1
+                res["input_ids"], res["attention_mask"], res["labels"] = find_index_and_replace(res["input_ids"], res["attention_mask"], res["labels"], image_number)
+                original_output["image"] = original_output["image"] + [res["image"]]
+                # cat res to original_output 
+            original_output["input_ids"] = original_output["input_ids"] + res["input_ids"]
+            original_output["attention_mask"] = original_output["attention_mask"] + res["attention_mask"]
+            original_output["labels"] = original_output["labels"] + res["labels"]
+        if image_number == 0:
+            raise ValueError("image number should not be zero, we now did not support no-image case.")
+        original_output["image_num"] = image_number
+        return original_output
+
+    def __getitem__(self, index):
+        res_list = []
+        for ann in self.annotation[index]:
+            image = self.process_image(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter)
+            text = self.process_text(ann,
+                                    data_debug_path=self.data_debug_path,
+                                    data_debug_counter=self.data_debug_counter,
+                                    first_message=(not res_list))
+            self.data_debug_counter += 1
+            res = self.tokenize(text)
+            res.update(image=image)
+            res.update(text)
+            res_list.append(res)
+        
+        output = self.merge_all_images(res_list)
+        return output
+
+    def collater(self, samples):
+        image_list, question_list, answer_list, input_id_list, attention_mask_list, labels_list = [], [], [], [], [], []
+
+        for sample in samples:
+            image_list.append(sample["image"])
+            question_list.append(sample["instruction"])
+            answer_list.append(sample["answer"])
+            input_id_list.append(sample["input_ids"])
+            attention_mask_list.append(sample["attention_mask"])
+            labels_list.append(sample["labels"])
+
+        # We have to pad the labels before calling `tokenizer.pad` as this method won't pad them and needs them of the
+        # same length to return tensors.
+        max_label_length = max(len(l) for l in labels_list)
+        padding_side = self.tokenizer.padding_side
+        padded_labels = []
+        for l in labels_list:
+            remainder = [DST.DEFAULT_LABEL_PADDING_NUM] * (max_label_length - len(l))
+            if isinstance(l, list):
+                l = l + remainder if padding_side == "right" else remainder + l
+            elif padding_side == "right":
+                l = np.concatenate([l, remainder]).astype(np.int64)
+            else:
+                l = np.concatenate([remainder, l]).astype(np.int64)
+            padded_labels.append(l)
+
+        padded_samples = self.tokenizer.pad(
+            {"input_ids": input_id_list, "attention_mask": attention_mask_list, "labels": padded_labels},
+            return_tensors="pt",
+            padding="longest",
+        )
+
+        # remove all image related tokens
+        labels = padded_samples["labels"]
+        labels[labels == self.tokenizer.pad_token_id] = DST.DEFAULT_LABEL_PADDING_NUM
+        labels[:, 0] = DST.DEFAULT_LABEL_PADDING_NUM
+        for k, v in self.image_token_dict.items():
+            labels[labels == v] = DST.DEFAULT_LABEL_PADDING_NUM
+        return {
+            "image": torch.stack(image_list, dim=0),
+            "input_ids": padded_samples["input_ids"],
+            "attention_mask": padded_samples["attention_mask"],
+            "labels": labels,
+            "instruction": question_list,
+            "answer": answer_list,
+        }
+
+
+class ConcatDataset(ConcatDataset):
+    def __init__(self, datasets: Iterable[Dataset]) -> None:
+        super().__init__(datasets)
+
+    def collater(self, samples):
+        # TODO For now only supports datasets with same underlying collater implementations
+
+        all_keys = set()
+        for s in samples:
+            all_keys.update(s)
+
+        shared_keys = all_keys
+        for s in samples:
+            shared_keys = shared_keys & set(s.keys())
+
+        samples_shared_keys = []
+        for s in samples:
+            samples_shared_keys.append({k: s[k] for k in s.keys() if k in shared_keys})
+
+        return self.datasets[0].collater(samples_shared_keys)
diff --git a/applications/DeepSpeed-VisualChat/utils/ds_utils.py b/applications/DeepSpeed-VisualChat/utils/ds_utils.py
new file mode 100755
index 000000000..7e04d9bb3
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/ds_utils.py
@@ -0,0 +1,97 @@
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+GLOBAL_BATCH_SIZE = 32
+MICRO_BATCH_SIZE = 4
+
+
+def get_train_ds_config(args,
+                        offload,
+                        stage=2,
+                        enable_hybrid_engine=False,
+                        inference_tp_size=1,
+                        release_inference_cache=False,
+                        pin_parameters=True,
+                        tp_gather_partition_size=8,
+                        max_out_tokens=512):
+    if args.precision == 'fp16':
+        enable_fp16 = True
+        enable_bf16 = False
+    elif args.precision == 'bf16':
+        enable_fp16 = False
+        enable_bf16 = True
+    else:
+        raise ValueError(f"Invalid precision {args.precision}")
+    device = "cpu" if offload else "none"
+    zero_opt_dict = {
+        "stage": stage,
+        "offload_param": {
+            "device": device
+        },
+        "offload_optimizer": {
+            "device": device
+        },
+        "stage3_param_persistence_threshold": 1e4,
+        "stage3_max_live_parameters": 3e7,
+        "stage3_prefetch_bucket_size": 0,
+        "memory_efficient_linear": False,
+    }
+    output =  {
+        "train_batch_size": GLOBAL_BATCH_SIZE,
+        "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE,
+        "steps_per_print": 10,
+        "zero_optimization": zero_opt_dict,
+        "zero_allow_untested_optimizer": True,
+        "zero_force_ds_cpu_optimizer": False,
+        "fp16": {
+            "enabled": enable_fp16,
+            "loss_scale_window": 100
+        },
+        "bf16": {
+            "enabled": enable_bf16,
+        },
+        "gradient_clipping": 1.0,
+        "prescale_gradients": False,
+        "wall_clock_breakdown": False,
+        "hybrid_engine": {
+            "enabled": enable_hybrid_engine,
+            "max_out_tokens": max_out_tokens,
+            "inference_tp_size": inference_tp_size,
+            "release_inference_cache": release_inference_cache,
+            "pin_parameters": pin_parameters,
+            "tp_gather_partition_size": tp_gather_partition_size,
+        },
+
+    }
+    if args.enable_tensorboard:
+        output.update({"tensorboard": {
+            "enabled": True,
+            "output_path": args.output_dir,
+            "job_name": 'tb_logging'
+        }}
+        )
+    return output
+
+def get_eval_ds_config(offload, stage=0):
+    device = "cpu" if offload else "none"
+    zero_opt_dict = {
+        "stage": stage,
+        "stage3_param_persistence_threshold": 1e4,
+        "offload_param": {
+            "device": device
+        },
+        "memory_efficient_linear": False
+    }
+    return {
+        "train_batch_size": GLOBAL_BATCH_SIZE,
+        "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE,
+        "steps_per_print": 10,
+        "zero_optimization": zero_opt_dict,
+        "fp16": {
+            "enabled": True
+        },
+        "gradient_clipping": 1.0,
+        "prescale_gradients": False,
+        "wall_clock_breakdown": False
+    }
diff --git a/applications/DeepSpeed-VisualChat/utils/model/__init__.py b/applications/DeepSpeed-VisualChat/utils/model/__init__.py
new file mode 100644
index 000000000..62685308b
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/model/__init__.py
@@ -0,0 +1 @@
+from .modeling_dsvl import create_dsvl_model_and_transforms
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/utils/model/modeling_dsvl.py b/applications/DeepSpeed-VisualChat/utils/model/modeling_dsvl.py
new file mode 100755
index 000000000..eb9db9428
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/model/modeling_dsvl.py
@@ -0,0 +1,377 @@
+import torch
+import torch.nn.functional as F
+from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM
+from transformers import CLIPVisionModel, CLIPImageProcessor 
+from .third_party_model.hf_model.modeling_llama import LlamaForCausalLM
+from .third_party_model.hf_model.configuration_llama import LlamaConfig
+from .third_party_model.qwen_clip.qwen_clip import VisionTransformer
+from torch import nn
+from torch.nn import  CrossEntropyLoss
+import copy
+import os
+import sys
+from ..data import build_dataset, DataCollatorPadToMaxLen, add_special_token
+sys.path.append(
+    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))
+import data.DST as DST # default special tokens
+from torch.utils.data import DataLoader
+from transformers.deepspeed import HfDeepSpeedConfig
+import numpy as np
+from .vis_proj import VisProjection_vit, VisProjection_perceiver
+
+def get_name(huggingface_path):
+    if 'opt' in huggingface_path.lower():
+        return 'opt'
+    elif 'gpt2' in huggingface_path.lower():
+        return 'gpt2'
+    elif 'llama-2' in huggingface_path.lower():
+        return 'llama-2'
+    else:
+        raise ValueError('We currently only support llama, opt and gpt2')
+
+def create_dsvl_model_and_transforms(
+        text_tokenizer=None,
+        ds_config=None,
+        args=None):
+    assert args.vision_model_name_or_path is not None
+    assert args.lm_model_name_or_path is not None
+    if ds_config is not None and ds_config["zero_optimization"]["stage"] == 3:
+        # https://huggingface.co/docs/transformers/main_classes/deepspeed#nontrainer-deepspeed-integration
+        dschf = HfDeepSpeedConfig(ds_config)
+    lang_config = AutoConfig.from_pretrained(args.lm_model_name_or_path)
+
+
+    if 'qwen' in args.vision_model_name_or_path.lower():
+        # use a fake config for consistent
+        vis_config = AutoConfig.from_pretrained("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")
+        vis_config = vis_config.vision_config
+        vis_encoder = VisionTransformer(
+            image_size=448,
+            patch_size=vis_config.patch_size,
+            width=vis_config.hidden_size,
+            layers=vis_config.num_hidden_layers,
+            heads=vis_config.num_attention_heads,
+            mlp_size=vis_config.intermediate_size,
+            output_dim=4096,
+        ) 
+        vis_encoder.load_state_dict(torch.load(os.path.join(args.vision_model_name_or_path, 'pytorch_model.bin'), map_location='cpu'), strict=True)
+        vis_config.hidden_size = 4096 # we need to change the hidden size to 4096
+    elif 'clip' in args.vision_model_name_or_path.lower():
+        vis_encoder = CLIPVisionModel.from_pretrained(args.vision_model_name_or_path) 
+        vis_config = vis_encoder.config
+    else:
+        raise ValueError("We currently only support qwen's modifed clip and other clip models")
+    
+    image_processor = CLIPImageProcessor.from_pretrained(args.vision_model_name_or_path)
+    
+    tokenizer = add_special_token(text_tokenizer)  
+    tokenizer.pad_token = tokenizer.eos_token
+    if 'llama' in args.lm_model_name_or_path.lower():
+        lang_config = LlamaConfig.from_pretrained(args.lm_model_name_or_path)
+        lang_config.enable_mmca_attention = args.enable_mmca_attention
+        lang_config.max_position_embeddings = args.max_seq_len
+    
+    if 'llama' in args.lm_model_name_or_path.lower():
+        if ds_config is not None and ds_config["zero_optimization"]["stage"] == 3:
+            lang_decoder = LlamaForCausalLM.from_pretrained(args.lm_model_name_or_path, config=lang_config)
+        else:
+            try:
+                device = torch.device("cuda", args.local_rank)
+            except:
+                device = "auto"
+            lang_decoder = LlamaForCausalLM.from_pretrained(args.lm_model_name_or_path, config=lang_config, device_map=device)
+        decoder_name = 'llama'
+    else:
+        raise NotImplemented("We for now only support LLaMA family and do not support other models yet")
+    
+    lang_config.vocab_size = len(tokenizer)
+    lang_decoder.resize_token_embeddings(len(tokenizer))
+    model = DeepSpeedViLModel(vis_encoder, lang_decoder, \
+                                tokenizer, \
+                                vis_config=vis_config, \
+                                decoder_name=decoder_name, \
+                                lang_config=lang_config, \
+                                max_seq_length=args.max_seq_len,
+                                args=args)
+    
+    return model, image_processor, tokenizer
+
+
+class DeepSpeedViLModel(nn.Module):
+    def __init__(self, vis_encoder,
+                    lang_decoder,
+                    tokenizer,
+                    vis_config=None, 
+                    decoder_name='gpt2',
+                    lang_config=None,
+                    max_seq_length=512,
+                    args=None):
+        super().__init__()
+        self.vis_encoder = vis_encoder
+         
+        self.lang_decoder = lang_decoder 
+        self.tokenizer = tokenizer 
+        self.args = args
+        self._enable_special_token()
+
+        self.lang_config = lang_config
+        self._get_model_stat(decoder_name)
+        lang_embed, pos_embedding = self._languag_embedding()
+        self.pos_embedding = pos_embedding
+        self.max_seq_length = max_seq_length
+        if lang_embed is None:
+            print ('randomly initialized a language embedding')
+            self.lang_embed = nn.Embedding(self.lang_config.vocab_size,\
+                                            self.hidden_size,\
+                                            self.pad_token_id) # randomly initialized language embedder
+        else:
+            self.lang_embed = lang_embed
+
+        self.pos_embedding = pos_embedding
+        self.projection = self.build_projection(vis_config, self.lang_config.hidden_size)   
+        self._init_weight()
+        
+
+        # get padding token embedding
+        self.padding_embedding = None 
+        self.vis_encoder_update = None
+
+    def _enable_special_token(self):
+        self.DEFAULT_IMAGE_TOKEN_ID = self.tokenizer.convert_tokens_to_ids(DST.DEFAULT_IMAGE_TOKEN)
+        self.DEFAULT_IMAGE_PATCH_TOKEN_ID = self.tokenizer.convert_tokens_to_ids(DST.DEFAULT_IMAGE_PATCH_TOKEN)
+        self.DEFAULT_IM_START_TOKEN_ID = self.tokenizer.convert_tokens_to_ids(DST.DEFAULT_IM_START_TOKEN)
+        self.DEFAULT_IM_END_TOKEN_ID = self.tokenizer.convert_tokens_to_ids(DST.DEFAULT_IM_END_TOKEN)
+
+        
+    def _get_model_stat(self, model_name):   
+        config_dic = {
+            'llama-2': ['max_position_embeddings','num_hidden_layers'],
+            'llama': ['max_position_embeddings','num_hidden_layers'],
+            'gpt2': ['n_positions','n_layer'],
+            'opt': ['max_position_embeddings','num_hidden_layers']
+        }
+        pos_name, layer_name = config_dic[model_name][0], config_dic[model_name][1]
+        self.n_positions = getattr(self.lang_config, pos_name)
+        self.num_layer = getattr(self.lang_config, layer_name)
+        self.hidden_size  = getattr(self.lang_config, 'hidden_size')
+        self.vocab_size = getattr(self.lang_config, 'vocab_size')
+        
+    def _languag_embedding(self):
+        pos_embedding = None
+        token_embedding = None
+        for name, module in self.lang_decoder.named_modules():
+            if isinstance(module, nn.Embedding):
+                try:
+                    # z3 shape
+                    rows = module.weight.ds_shape[0]
+                except:
+                    rows = module.weight.size()[0]
+                     
+                if rows == self.vocab_size:
+                    token_embedding = copy.deepcopy(module)
+                if rows == self.n_positions:
+                    pos_embedding = copy.deepcopy(module)
+        return token_embedding, pos_embedding
+     
+        
+    def _init_weight(self):
+        self.vis_encoder.requires_grad_(False)  
+        self.lang_decoder.requires_grad_(False)  
+        self.lang_embed.requires_grad_(True)   
+        self.projection.requires_grad_(True) 
+        if  self.pos_embedding  is not None:     
+            self.pos_embedding.requires_grad_(True) 
+        
+
+    def build_projection(self, vis_config, lang_dim):
+        if self.args.vis_proj == 'vit':
+            output =  VisProjection_vit(vis_config, lang_dim=lang_dim)
+            return output 
+        elif self.args.vis_proj == 'baseline':
+            return nn.Sequential( 
+                            nn.Linear(vis_config.hidden_size, lang_dim), # an example implementation
+                            nn.LayerNorm(lang_dim, eps=1e-12))
+        elif self.args.vis_proj == 'perceiver':
+            return VisProjection_perceiver(vis_config, lang_dim=lang_dim)
+
+    def concat(self, img_proj, lang, attention_mask, input_labels, image_num, do_generation=False):
+        output_lang = []
+        output_attention_mask = []
+        output_input_labels = []
+
+        def split_tensor_by_a_list(tensor, split_list):
+            output = []
+            initial_pos = 0
+            accumulated_sum = [sum(split_list[:i]) for i in range(1, len(split_list)+1)]
+            for pos in accumulated_sum:
+                output.append(tensor[initial_pos:pos])
+                initial_pos = pos
+            del tensor
+            return output
+        
+        img_proj = split_tensor_by_a_list(img_proj, image_num)
+        
+        for index in range(len(img_proj)): # each seq has multi iamges, so we need to use it as index
+            initial_pos = 0
+            cur_img = img_proj[index]
+            cur_lang = lang[index]
+            cur_attention_mask = attention_mask[index]
+            cur_input_labels = input_labels[index]
+            img_pos_list = cur_lang.eq(self.DEFAULT_IMAGE_TOKEN_ID).nonzero(as_tuple=True)[0]
+            assert len(img_pos_list) == image_num[index], "the number of images in the lang and image_num does not match"
+            if len(img_pos_list) == 0:
+                continue # there is no image probably it is a pure text insturctio
+            
+            cur_lang = self.lang_embed(cur_lang) # get the real embedding
+            for img_i, img_pos in zip(cur_img, torch.flip(img_pos_list, dims=(0,))): # do it reversely so that we can easily insert the image
+                lang_pre_img_embed = cur_lang[initial_pos:img_pos]
+                attention_mask_pre_img = cur_attention_mask[initial_pos:img_pos]
+                input_labels_pre_img = cur_input_labels[initial_pos:img_pos]
+
+                lang_post_img_embed = cur_lang[img_pos+1:]
+                attention_mask_post_img = cur_attention_mask[img_pos+1:]
+                input_labels_post_img = cur_input_labels[img_pos+1:]
+                # now we need to concat the image embedding
+                lang_full = torch.cat((lang_pre_img_embed, img_i, lang_post_img_embed), dim=0)
+                # label the position of all images as 2 instead of 1
+    
+                attention_mask_full = torch.cat( (attention_mask_pre_img, 2 * torch.ones_like(img_i[:, 0]), attention_mask_post_img), dim=0)
+
+                input_labels_full = torch.cat((input_labels_pre_img.long(), DST.DEFAULT_LABEL_PADDING_NUM * torch.ones_like(img_i[:, 0], dtype=torch.long), input_labels_post_img),   dim=0)
+
+                cur_lang = lang_full
+                cur_attention_mask = attention_mask_full
+                cur_input_labels = input_labels_full
+            # append to the output 
+            output_lang.append(lang_full.unsqueeze(0))
+            output_attention_mask.append(attention_mask_full.unsqueeze(0))
+            output_input_labels.append(input_labels_full.unsqueeze(0))
+
+        if self.padding_embedding is None:
+            with torch.no_grad():
+                self.padding_embedding = self.lang_embed(torch.tensor(self.tokenizer.pad_token_id).to(lang.device).unsqueeze(0)).unsqueeze(0).detach()
+
+        def pad_tensor_list(tensor_list, pad_token_id, pad_vec=False):
+            max_len = max([tensor.size(1) for tensor in tensor_list])
+            if not do_generation:
+                max_len = int(np.ceil(max_len / 8) * 8) # make it divisible by 8
+            padded_tensor_list = []
+            for tensor in tensor_list:
+                if max_len > tensor.size(1):
+                    if pad_vec: # output_lang padding
+                        # pad with self.padding_embedding 
+                        padded_tensor = torch.cat([tensor] + [self.padding_embedding] * (max_len - tensor.size(1)), dim=1)
+                        
+                    else:
+                        padded_tensor = F.pad(tensor, (0, max_len - tensor.size(1)), value=pad_token_id)
+                else:
+                    padded_tensor = tensor
+                padded_tensor_list.append(padded_tensor)
+            return padded_tensor_list
+        output_lang = pad_tensor_list(output_lang, self.tokenizer.pad_token_id, pad_vec=True)
+        output_attention_mask = pad_tensor_list(output_attention_mask, 0)
+        output_input_labels = pad_tensor_list(output_input_labels, DST.DEFAULT_LABEL_PADDING_NUM)
+
+        return torch.cat(output_lang, dim=0), torch.cat(output_attention_mask, dim=0), torch.cat(output_input_labels, dim=0)
+
+    def forward(self, img, lang, 
+            attention_mask=None,
+            input_labels=None,
+            image_num=1,
+            past_key_values=None,
+            use_cache=False,
+            output_attentions=False, 
+            output_hidden_states=False,
+            return_dict=True):
+        
+        assert attention_mask is not None, "attention mask is required"
+        assert input_labels is not None, "input labels is required"
+
+        if self.vis_encoder_update is None:
+            self.vis_encoder_update = False # default is False
+            for p in self.vis_encoder.parameters():
+                if p.requires_grad:
+                    self.vis_encoder_update = True
+        # this part for now does not require gradient
+        if self.vis_encoder_update:
+            # update vis encoder
+            img_feature = self.vis_encoder(img) 
+            if not isinstance(img_feature, torch.Tensor):
+                img_feature = img_feature.last_hidden_state
+        else:
+            # do not update vis encoder
+            with torch.no_grad():
+                img_feature = self.vis_encoder(img)
+                if not isinstance(img_feature, torch.Tensor):
+                    img_feature = img_feature.last_hidden_state
+        img_proj = self.projection(img_feature)
+       
+        hidden_states, attention_mask, input_labels = self.concat(img_proj, lang, attention_mask, input_labels, image_num)
+        labels = input_labels   
+            
+        if self.pos_embedding is not None:
+            if past_key_values is None:
+                past_length = 0
+            else:
+                past_length = past_key_values[0][0].size(-2)
+            position_ids = torch.arange(past_length, hidden_states.size()[1] + past_length, dtype=torch.long, device=hidden_states.device)
+            position_ids = position_ids.unsqueeze(0).view(-1, hidden_states.size()[1])
+            position_embeds = self.pos_embedding(position_ids)
+            hidden_states = hidden_states + position_embeds
+            
+        logits = self.lang_decoder(input_ids=None, 
+                                    inputs_embeds=hidden_states,
+                                    attention_mask=attention_mask,
+                                    labels=None,
+                                    past_key_values=past_key_values,
+                                    use_cache=use_cache,
+                                    output_attentions=output_attentions, 
+                                    output_hidden_states=output_hidden_states,
+                                    return_dict=return_dict).logits
+        
+        
+        logits_shift = logits[..., :-1, :].contiguous().view(-1, self.vocab_size) # remove the last token
+        labels_shift = labels[..., 1:].contiguous().to(logits_shift.device).view(-1) # remove the first token
+        # select index that is not -100
+        labels_index = labels_shift != -100
+        if torch.sum(labels_index) ==0:
+            logits_shift = logits_shift[-2:,:].contiguous()
+            labels_shift = labels_shift[-2:].contiguous()            
+        else:
+            logits_shift = logits_shift[labels_index,:].contiguous()
+            labels_shift = labels_shift[labels_index].contiguous()
+
+        loss_fct = CrossEntropyLoss() 
+        loss = loss_fct(logits_shift, labels_shift) 
+        
+        return [loss,] 
+    
+    @torch.no_grad()
+    def generate(self, img, lang, 
+            attention_mask=None,
+            input_labels=None,
+            generation_length=128,
+            generation_kwargs={}, # add some meaningful default values
+            ):
+        assert lang.size()[0] == 1, "only support batch size == 1 for now"
+        attention_mask = torch.ones_like(lang) 
+        input_labels = torch.ones_like(lang) 
+        # this part for now does not require gradient
+        img_feature = self.vis_encoder(img) 
+        if not isinstance(img_feature, torch.Tensor):
+            img_feature = img_feature.last_hidden_state
+        img_proj = self.projection(img_feature)
+        hidden_states, attention_mask, input_labels = self.concat(img_proj, lang, attention_mask, input_labels, image_num=[img.size(0)], do_generation=True)
+        
+        output = self.lang_decoder.generate(input_ids=None,
+                                inputs_embeds=hidden_states,
+                                attention_mask=attention_mask, # we need the mask to diff img and text
+                                pad_token_id=self.tokenizer.pad_token_id,
+                                max_new_tokens=generation_length, # this is the number of tokens you want to generate
+                                **generation_kwargs)
+        return (output, self.tokenizer.batch_decode(output, skip_special_tokens=True)[0])
+
+
+    def gradient_checkpointing_enable(self):
+        self.vis_encoder.gradient_checkpointing_enable()
+        self.lang_decoder.gradient_checkpointing_enable()
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/utils/model/third_party_model/hf_model/configuration_llama.py b/applications/DeepSpeed-VisualChat/utils/model/third_party_model/hf_model/configuration_llama.py
new file mode 100755
index 000000000..9b0f0ee69
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/model/third_party_model/hf_model/configuration_llama.py
@@ -0,0 +1,175 @@
+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" LLaMA model configuration"""
+
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
+
+
+class LlamaConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the LLaMA-7B.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`LlamaModel`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 11008):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
+            `num_attention_heads`.
+        pretraining_tp (`int`, *optional*, defaults to `1`):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
+            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
+            issue](https://github.com/pytorch/pytorch/issues/76232).
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        tie_word_embeddings(`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format
+            is `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
+            these scaling strategies behave:
+            https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
+            experimental feature, subject to breaking API changes in future versions.
+
+        Example:
+
+    ```python
+    >>> from transformers import LlamaModel, LlamaConfig
+
+    >>> # Initializing a LLaMA llama-7b style configuration
+    >>> configuration = LlamaConfig()
+
+    >>> # Initializing a model from the llama-7b style configuration
+    >>> model = LlamaModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "llama"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(
+        self,
+        vocab_size=32000,
+        hidden_size=4096,
+        intermediate_size=11008,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=None,
+        hidden_act="silu",
+        max_position_embeddings=2048,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        pad_token_id=None,
+        bos_token_id=1,
+        eos_token_id=2,
+        pretraining_tp=1,
+        tie_word_embeddings=False,
+        rope_scaling=None,
+        enable_mmca_attention=False,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.enable_mmca_attention = enable_mmca_attention
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.pretraining_tp = pretraining_tp
+        self.use_cache = use_cache
+        self.rope_scaling = rope_scaling
+        self._rope_scaling_validation()
+
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+
+    def _rope_scaling_validation(self):
+        """
+        Validate the `rope_scaling` configuration.
+        """
+        if self.rope_scaling is None:
+            return
+
+        if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
+            raise ValueError(
+                "`rope_scaling` must be a dictionary with with two fields, `name` and `factor`, "
+                f"got {self.rope_scaling}"
+            )
+        rope_scaling_type = self.rope_scaling.get("type", None)
+        rope_scaling_factor = self.rope_scaling.get("factor", None)
+        if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
+            raise ValueError(
+                f"`rope_scaling`'s name field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
+            )
+        if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
+            raise ValueError(f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}")
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/utils/model/third_party_model/hf_model/modeling_llama.py b/applications/DeepSpeed-VisualChat/utils/model/third_party_model/hf_model/modeling_llama.py
new file mode 100755
index 000000000..b8e5c2ac6
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/model/third_party_model/hf_model/modeling_llama.py
@@ -0,0 +1,1096 @@
+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch LLaMA model."""
+import math
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
+# from transformers.configuration_llama import LlamaConfig
+from transformers import LlamaConfig
+
+
+
+logger = logging.get_logger(__name__)
+
+_CONFIG_FOR_DOC = "LlamaConfig"
+
+
+# Copied from transformers.models.bart.modeling_bart._make_causal_mask
+def _make_causal_mask(
+    input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
+):
+    """
+    Make causal mask used for bi-directional self-attention.
+    """
+    bsz, tgt_len = input_ids_shape
+    mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
+    mask_cond = torch.arange(mask.size(-1), device=device)
+    mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
+    mask = mask.to(dtype)
+
+    if past_key_values_length > 0:
+        mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
+    return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
+
+
+# Copied from transformers.models.bart.modeling_bart._expand_mask
+def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None, enable_mmca_attention=False):
+    """
+    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+    """
+    # we need two method here 
+    # import pdb; pdb.set_trace()
+    # assert tgt_len == mask.size(-1), "tgt_len is not supported"
+    if enable_mmca_attention is False:
+        # basically, standard mask generation
+        mask = (mask > 0).to(mask.dtype) # our mask will have 0: padding, 1: text, and 2: image
+        bsz, src_len = mask.size()
+        tgt_len = tgt_len if tgt_len is not None else src_len
+
+        expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
+
+        inverted_mask = 1.0 - expanded_mask
+
+        return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
+    else:
+        bsz, src_len = mask.size()
+        tgt_len = tgt_len if tgt_len is not None else src_len
+        expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
+
+        # image mask
+        mask_img = mask.clone() 
+        mask_img[mask_img!=2] = 0 # for all non-image part, we make them to be 0
+        mask_img[mask_img==2] = 1 # for all image part, we make them to be 1
+
+        expanded_mask_img = mask_img[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
+        # make diagonal to be 1 this part is not needed
+        # expanded_mask_img = expanded_mask_img + torch.eye(mask.shape[-1], dtype=mask.dtype, device=mask.device)[None, None, :, :] 
+        inverted_mask_img = 1.0 - expanded_mask_img 
+        inverted_mask_img = inverted_mask_img.masked_fill(inverted_mask_img.to(torch.bool), torch.finfo(dtype).min)
+
+        # image tokens does not attennd to image tokens
+        if tgt_len == src_len:
+            # TODO: basically, the prompt phase, need to revisit this part
+            for i in range(bsz):
+                for j in range(tgt_len):
+                    if mask[i, j] == 2:
+                        # if it is image token, we make it to be 0 for previous attention
+                        inverted_mask_img[i, :, j, :] = torch.finfo(dtype).min
+                        inverted_mask_img[i, :, j, j] = 0
+
+
+        # text mask 
+        mask_text = mask.clone()
+        mask_text[mask_text!=1] = 0 # for all non-text part, we make them to be 0
+        mask_text[mask_text==1] = 1 # for all text part, we make them to be 1
+        expanded_mask_text = mask_text[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
+        # make diagonal to be 1
+        # expanded_mask_text = expanded_mask_text + torch.eye(mask.shape[-1], dtype=mask.dtype, device=mask.device)[None, None, :, :]
+        inverted_mask_text = 1.0 - expanded_mask_text
+        inverted_mask_text = inverted_mask_text.masked_fill(inverted_mask_text.to(torch.bool), torch.finfo(dtype).min)
+
+        return [inverted_mask_img, inverted_mask_text] # return two masks
+
+
+
+
+class LlamaRMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        LlamaRMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+
+
+class LlamaRotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
+        super().__init__()
+
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+        # Build here to make `torch.jit.trace` work.
+        self._set_cos_sin_cache(
+            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
+
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
+
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+
+        return (
+            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
+            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
+        )
+
+
+class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
+    """LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
+
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
+        self.scaling_factor = scaling_factor
+        super().__init__(dim, max_position_embeddings, base, device)
+
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
+        t = t / self.scaling_factor
+
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
+
+
+class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
+    """LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
+
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
+        self.scaling_factor = scaling_factor
+        super().__init__(dim, max_position_embeddings, base, device)
+
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+
+        if seq_len > self.max_position_embeddings:
+            base = self.base * (
+                (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
+            ) ** (self.dim / (self.dim - 2))
+            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+            self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
+
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
+
+
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
+    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
+    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
+    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
+    # import pdb; pdb.set_trace()
+    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
+    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+class LlamaMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, x):
+        if self.config.pretraining_tp > 1:
+            slice = self.intermediate_size // self.config.pretraining_tp
+            gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
+            up_proj_slices = self.up_proj.weight.split(slice, dim=0)
+            down_proj_slices = self.down_proj.weight.split(slice, dim=1)
+
+            gate_proj = torch.cat(
+                [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
+            )
+            up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
+
+            intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
+            down_proj = [
+                F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
+            ]
+            down_proj = sum(down_proj)
+        else:
+            down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+
+        return down_proj
+
+
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
+class LlamaAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config: LlamaConfig):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        self.max_position_embeddings = config.max_position_embeddings
+
+        if (self.head_dim * self.num_heads) != self.hidden_size:
+            raise ValueError(
+                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+                f" and `num_heads`: {self.num_heads})."
+            )
+        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
+        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
+        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+        self.enable_mmca_attention = config.enable_mmca_attention
+        self._init_rope()
+
+    def _init_rope(self):
+        if self.config.rope_scaling is None:
+            self.rotary_emb = LlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings)
+        else:
+            scaling_type = self.config.rope_scaling["type"]
+            scaling_factor = self.config.rope_scaling["factor"]
+            if scaling_type == "linear":
+                self.rotary_emb = LlamaLinearScalingRotaryEmbedding(
+                    self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor
+                )
+            elif scaling_type == "dynamic":
+                self.rotary_emb = LlamaDynamicNTKScalingRotaryEmbedding(
+                    self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor
+                )
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
+
+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+
+        if self.config.pretraining_tp > 1:
+            key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
+            query_slices = self.q_proj.weight.split(
+                (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
+            )
+            key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
+            value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
+
+            query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
+            query_states = torch.cat(query_states, dim=-1)
+
+            key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
+            key_states = torch.cat(key_states, dim=-1)
+
+            value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
+            value_states = torch.cat(value_states, dim=-1)
+
+        else:
+            query_states = self.q_proj(hidden_states)
+            key_states = self.k_proj(hidden_states)
+            value_states = self.v_proj(hidden_states)
+
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+
+        kv_seq_len = key_states.shape[-2]
+        if past_key_value is not None:
+            kv_seq_len += past_key_value[0].shape[-2]
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+
+        if past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = torch.cat([past_key_value[0], key_states], dim=2)
+            value_states = torch.cat([past_key_value[1], value_states], dim=2)
+
+        past_key_value = (key_states, value_states) if use_cache else None
+
+        # repeat k/v heads if n_kv_heads < n_heads
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+
+        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
+
+        if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
+            raise ValueError(
+                f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
+                f" {attn_weights.size()}"
+            )
+
+        if attention_mask is not None:
+            if self.enable_mmca_attention is False:
+                if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
+                    raise ValueError(
+                        f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
+                    )
+            else:
+                if attention_mask[0].size() != (bsz, 1, q_len, kv_seq_len):
+                    raise ValueError(
+                        f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
+                    )
+            if self.enable_mmca_attention is False:
+                attn_weights = attn_weights + attention_mask
+            else:
+                attn_weights_img = attn_weights + attention_mask[0]
+                attn_weights_text = attn_weights + attention_mask[1]
+
+        # upcast attention to fp32
+        if self.enable_mmca_attention is False:
+            attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
+        else:
+            attn_weights_img = nn.functional.softmax(attn_weights_img, dim=-1, dtype=torch.float32).to(query_states.dtype)
+            attn_weights_text = nn.functional.softmax(attn_weights_text, dim=-1, dtype=torch.float32).to(query_states.dtype)
+            attn_weights = (attn_weights_img + attn_weights_text)  #TODO: shall we reduce the weights of the diagonal part?
+
+        attn_output = torch.matmul(attn_weights, value_states)
+
+        if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
+            raise ValueError(
+                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
+                f" {attn_output.size()}"
+            )
+
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+
+        if self.config.pretraining_tp > 1:
+            attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
+            o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
+            attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
+        else:
+            attn_output = self.o_proj(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, attn_weights, past_key_value
+
+
+class LlamaDecoderLayer(nn.Module):
+    def __init__(self, config: LlamaConfig):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = LlamaAttention(config=config)
+        self.mlp = LlamaMLP(config)
+        self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+        """
+
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # Self Attention
+        hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+        )
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)
+
+        if use_cache:
+            outputs += (present_key_value,)
+
+        return outputs
+
+
+LLAMA_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`LlamaConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+@add_start_docstrings(
+    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
+    LLAMA_START_DOCSTRING,
+)
+class LlamaPreTrainedModel(PreTrainedModel):
+    config_class = LlamaConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["LlamaDecoderLayer"]
+    _skip_keys_device_placement = "past_key_values"
+
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, LlamaModel):
+            module.gradient_checkpointing = value
+
+
+LLAMA_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            [What are attention masks?](../glossary#attention-mask)
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
+            `past_key_values`).
+
+            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+            information on the default strategy.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.n_positions - 1]`.
+
+            [What are position IDs?](../glossary#position-ids)
+        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
+    LLAMA_START_DOCSTRING,
+)
+class LlamaModel(LlamaPreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
+
+    Args:
+        config: LlamaConfig
+    """
+
+    def __init__(self, config: LlamaConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+        self.gradient_checkpointing = False
+        self.enable_mmca_attention = config.enable_mmca_attention # this is new :)
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
+    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        combined_attention_mask = None
+        if input_shape[-1] > 1:
+            combined_attention_mask = _make_causal_mask(
+                input_shape,
+                inputs_embeds.dtype,
+                device=inputs_embeds.device,
+                past_key_values_length=past_key_values_length,
+            )
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1], enable_mmca_attention=self.enable_mmca_attention)
+            # .to(
+            #     inputs_embeds.device
+            # )
+
+            if self.enable_mmca_attention:
+                # if cross attention, we have two masks, this is from _expand_mask
+                expanded_attn_mask = [expanded_attn_mask[0].to(inputs_embeds.device), expanded_attn_mask[1].to(inputs_embeds.device)]
+                if combined_attention_mask is None:
+                    combined_attention_mask = expanded_attn_mask
+                else:
+                    combined_attention_mask = (expanded_attn_mask[0] + combined_attention_mask, 
+                                               expanded_attn_mask[1] + combined_attention_mask)
+            else:
+                expanded_attn_mask = expanded_attn_mask.to(inputs_embeds.device)
+                combined_attention_mask = (
+                    expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
+                )
+        # import pdb; pdb.set_trace()
+        # (combined_attention_mask[1][0,0][50] + combined_attention_mask[0][0,0][50])[51]
+        return combined_attention_mask
+
+    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+        # import pdb; pdb.set_trace()
+        if position_ids is None:
+            device = input_ids.device if input_ids is not None else inputs_embeds.device
+            position_ids = torch.arange(
+                past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
+            )
+            position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
+        else:
+            position_ids = position_ids.view(-1, seq_length).long()
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        # embed positions
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
+            )
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
+        )
+
+        hidden_states = inputs_embeds
+
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        for idx, decoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        # None for past_key_value
+                        return module(*inputs, output_attentions, None)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(decoder_layer),
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    None,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_value,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if use_cache:
+                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+
+class LlamaForCausalLM(LlamaPreTrainedModel):
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = LlamaModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.model = decoder
+
+    def get_decoder(self):
+        return self.model
+
+    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, LlamaForCausalLM
+
+        >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = outputs[0]
+        if self.config.pretraining_tp > 1:
+            lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
+            logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
+            logits = torch.cat(logits, dim=-1)
+        else:
+            logits = self.lm_head(hidden_states)
+        logits = logits.float()
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(
+        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    ):
+        if past_key_values:
+            input_ids = input_ids[:, -1:]
+
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            # TODO: I used a hacky way to create position_ids, but it works for now
+            attnetion_mask_clone = attention_mask.clone() > 0
+            position_ids = attnetion_mask_clone.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attnetion_mask_clone == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+            }
+        )
+        return model_inputs
+
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
+        return reordered_past
+
+
+@add_start_docstrings(
+    """
+    The LLaMa Model transformer with a sequence classification head on top (linear layer).
+
+    [`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
+    (e.g. GPT-2) do.
+
+    Since it does classification on the last token, it requires to know the position of the last token. If a
+    `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
+    no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
+    padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
+    each row of the batch).
+    """,
+    LLAMA_START_DOCSTRING,
+)
+class LlamaForSequenceClassification(LlamaPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.model = LlamaModel(config)
+        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+
+    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        logits = self.score(hidden_states)
+
+        if input_ids is not None:
+            batch_size = input_ids.shape[0]
+        else:
+            batch_size = inputs_embeds.shape[0]
+
+        if self.config.pad_token_id is None and batch_size != 1:
+            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
+        if self.config.pad_token_id is None:
+            sequence_lengths = -1
+        else:
+            if input_ids is not None:
+                sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).long().argmax(-1) - 1).to(
+                    logits.device
+                )
+            else:
+                sequence_lengths = -1
+
+        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
+
+        loss = None
+        if labels is not None:
+            labels = labels.to(logits.device)
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(pooled_logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(pooled_logits, labels)
+        if not return_dict:
+            output = (pooled_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutputWithPast(
+            loss=loss,
+            logits=pooled_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
diff --git a/applications/DeepSpeed-VisualChat/utils/model/third_party_model/qwen_clip/qwen_clip.py b/applications/DeepSpeed-VisualChat/utils/model/third_party_model/qwen_clip/qwen_clip.py
new file mode 100644
index 000000000..4bb87426d
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/model/third_party_model/qwen_clip/qwen_clip.py
@@ -0,0 +1,424 @@
+# Copyright (c) Alibaba Cloud.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+from collections import OrderedDict
+import math
+import requests
+from io import BytesIO
+from functools import partial
+from PIL import Image
+from typing import Callable, Optional, Sequence, Tuple, List
+import numpy as np
+
+import torch
+from torch import nn
+from torch.nn import functional as F
+from torch.nn.init import trunc_normal_
+from torchvision import transforms
+from torchvision.transforms import InterpolationMode
+
+
+def get_abs_pos(abs_pos, tgt_size):
+    # abs_pos: L, C
+    # tgt_size: M
+    # return: M, C
+    src_size = int(math.sqrt(abs_pos.size(0)))
+    tgt_size = int(math.sqrt(tgt_size))
+    dtype = abs_pos.dtype
+
+    if src_size != tgt_size:
+        return F.interpolate(
+            abs_pos.float().reshape(1, src_size, src_size, -1).permute(0, 3, 1, 2),
+            size=(tgt_size, tgt_size),
+            mode="bicubic",
+            align_corners=False,
+        ).permute(0, 2, 3, 1).flatten(0, 2).to(dtype=dtype)
+    else:
+        return abs_pos
+
+# https://github.com/facebookresearch/mae/blob/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/pos_embed.py#L20
+def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):
+    """
+    grid_size: int of the grid height and width
+    return:
+    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    grid_h = np.arange(grid_size, dtype=np.float32)
+    grid_w = np.arange(grid_size, dtype=np.float32)
+    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+    grid = np.stack(grid, axis=0)
+
+    grid = grid.reshape([2, 1, grid_size, grid_size])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if cls_token:
+        pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed], axis=0)
+    return pos_embed
+
+
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+    assert embed_dim % 2 == 0
+
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
+
+    emb = np.concatenate([emb_h, emb_w], axis=1) # (H*W, D)
+    return emb
+
+
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position
+    pos: a list of positions to be encoded: size (M,)
+    out: (M, D)
+    """
+    assert embed_dim % 2 == 0
+    omega = np.arange(embed_dim // 2, dtype=np.float32)
+    omega /= embed_dim / 2.
+    omega = 1. / 10000**omega  # (D/2,)
+
+    pos = pos.reshape(-1)  # (M,)
+    out = np.einsum('m,d->md', pos, omega)  # (M, D/2), outer product
+
+    emb_sin = np.sin(out) # (M, D/2)
+    emb_cos = np.cos(out) # (M, D/2)
+
+    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
+    return emb
+
+
+class Resampler(nn.Module):
+    """
+    A 2D perceiver-resampler network with one cross attention layers by
+        (grid_size**2) learnable queries and 2d sincos pos_emb
+    Outputs:
+        A tensor with the shape of (grid_size**2, embed_dim)
+    """
+    def __init__(
+            self,
+            grid_size,
+            embed_dim,
+            num_heads,
+            kv_dim=None,
+            norm_layer=nn.LayerNorm
+    ):
+        super().__init__()
+        self.num_queries = grid_size ** 2
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+
+        self.pos_embed = nn.Parameter(
+            torch.from_numpy(get_2d_sincos_pos_embed(embed_dim, grid_size)).float()
+        ).requires_grad_(False)
+
+        self.query = nn.Parameter(torch.zeros(self.num_queries, embed_dim))
+        trunc_normal_(self.query, std=.02)
+
+        if kv_dim is not None and kv_dim != embed_dim:
+            self.kv_proj = nn.Linear(kv_dim, embed_dim, bias=False)
+        else:
+            self.kv_proj = nn.Identity()
+
+        self.attn = nn.MultiheadAttention(embed_dim, num_heads)
+        self.ln_q = norm_layer(embed_dim)
+        self.ln_kv = norm_layer(embed_dim)
+        
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    def forward(self, x, attn_mask=None):
+
+        pos_embed = get_abs_pos(self.pos_embed, x.size(1))
+
+        x = self.kv_proj(x)
+        x = self.ln_kv(x).permute(1, 0, 2)
+
+        N = x.shape[1]
+        q = self.ln_q(self.query)
+        out = self.attn(
+            self._repeat(q, N) + self.pos_embed.unsqueeze(1),
+            x + pos_embed.unsqueeze(1),
+            x,
+            attn_mask=attn_mask)[0]
+        return out.permute(1, 0, 2)
+
+    def _repeat(self, query, N: int):
+        return query.unsqueeze(1).repeat(1, N, 1)
+
+
+class VisualAttention(nn.Module):
+    """self-attention layer class.
+    Self-attention layer takes input with size [s, b, h]
+    and returns output of the same size.
+    """
+
+    def __init__(self, embed_dim, num_heads,
+                 bias=True, kdim=None, vdim=None):
+        super(VisualAttention, self).__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self._qkv_same_embed_dim = self.kdim == embed_dim and self.vdim == embed_dim
+
+        self.num_heads = num_heads
+
+        # Per attention head and per partition values.
+        assert embed_dim % num_heads == 0
+        self.hidden_size_per_attention_head = embed_dim // num_heads
+        self.num_attention_heads_per_partition = num_heads
+        self.hidden_size_per_partition = embed_dim
+
+        # Strided linear layer.
+        assert self._qkv_same_embed_dim, 'Only Support SelfAttention Currently'
+        self.in_proj = nn.Linear(embed_dim, 3 * embed_dim)
+        self.out_proj = nn.Linear(embed_dim, embed_dim)
+        self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
+
+    def forward(self, query, key, value, attn_mask = None):
+        # query/key/value: [sq, b, h]
+        sq, b, _ = query.size()
+        # print("Diff", (query-key).norm())
+        # assert query is key, 'Only Support Self-Attention Currently'
+        sk = sq
+        mixed_x_layer = self.in_proj(query)
+
+        # [sq, b, (np * 3 * hn)] --> [sq, b, np, 3 * hn]
+        new_tensor_shape = mixed_x_layer.size()[:-1] + \
+            (self.num_attention_heads_per_partition,
+             3 * self.hidden_size_per_attention_head)
+        mixed_x_layer = mixed_x_layer.view(*new_tensor_shape)
+
+        # [sq, b, np, 3 * hn] --> 3 [sq, b, np, hn]
+        query_layer, key_layer, value_layer = mixed_x_layer.split(
+            self.hidden_size_per_attention_head, dim=-1)
+
+        # [sq, b, np, hn] -> [sq, b * np, hn]
+        query_layer = query_layer.view(sq,
+            b * self.num_attention_heads_per_partition,
+            self.hidden_size_per_attention_head).transpose(0, 1)
+        # [sk, b, np, hn] -> [sk, b * np, hn]
+        key_layer = key_layer.view(sk,
+            b * self.num_attention_heads_per_partition,
+            self.hidden_size_per_attention_head).transpose(0, 1)
+
+        q_scaled = query_layer / self.norm_factor
+        if attn_mask is not None:
+            attention_probs = torch.baddbmm(attn_mask, q_scaled, key_layer.transpose(-2, -1))
+        else:
+            attention_probs = torch.bmm(q_scaled, key_layer.transpose(-2, -1))
+        attention_probs = attention_probs.softmax(dim=-1)
+
+        value_layer = value_layer.view(sk,
+            b * self.num_attention_heads_per_partition,
+            self.hidden_size_per_attention_head).transpose(0, 1)
+
+        # matmul: [b * np, sq, hn]
+        context_layer = torch.bmm(attention_probs, value_layer)
+
+        # change view [b, np, sq, hn]
+        context_layer = context_layer.view(b,
+            self.num_attention_heads_per_partition,
+            sq, self.hidden_size_per_attention_head)
+
+        # [b, np, sq, hn] --> [sq, b, np, hn]
+        context_layer = context_layer.permute(2, 0, 1, 3).contiguous()
+
+        # [sq, b, np, hn] --> [sq, b, hp]
+        new_context_layer_shape = context_layer.size()[:-2] + \
+            (self.hidden_size_per_partition,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        output = self.out_proj(context_layer)
+
+        return output
+
+
+class VisualAttentionBlock(nn.Module):
+    def __init__(
+            self,
+            d_model: int,
+            n_head: int,
+            mlp_size: int,
+            act_layer: Callable = nn.GELU,
+            norm_layer: Callable = nn.LayerNorm,
+            is_cross_attention: bool = False,
+    ):
+        super().__init__()
+
+        self.ln_1 = norm_layer(d_model)
+        if is_cross_attention:
+            self.ln_1_kv = norm_layer(d_model)
+
+        self.ln_2 = norm_layer(d_model)
+        mlp_width = int(mlp_size)
+        self.attn = VisualAttention(d_model, n_head)
+        self.mlp = nn.Sequential(OrderedDict([
+            ("c_fc", nn.Linear(d_model, mlp_width)),
+            ("gelu", act_layer()),
+            ("c_proj", nn.Linear(mlp_width, d_model))
+        ]))
+
+    def attention(
+            self,
+            q_x: torch.Tensor,
+            k_x: Optional[torch.Tensor] = None,
+            v_x: Optional[torch.Tensor] = None,
+            attn_mask: Optional[torch.Tensor] = None,
+    ):
+        k_x = k_x if k_x is not None else q_x
+        v_x = v_x if v_x is not None else q_x
+        # k_x = q_x 
+        # v_x = q_x
+
+        attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
+        return self.attn(q_x, k_x, v_x, attn_mask=attn_mask)
+
+    def forward(
+            self,
+            q_x: torch.Tensor,
+            k_x: Optional[torch.Tensor] = None,
+            v_x: Optional[torch.Tensor] = None,
+            attn_mask: Optional[torch.Tensor] = None,
+    ):
+        k_x = self.ln_1_kv(k_x) if hasattr(self, "ln_1_kv") and k_x is not None else None
+        v_x = self.ln_1_kv(v_x) if hasattr(self, "ln_1_kv") and v_x is not None else None
+
+        x = q_x + self.attention(q_x=self.ln_1(q_x), k_x=k_x, v_x=v_x, attn_mask=attn_mask)
+        x = x + self.mlp(self.ln_2(x))
+        return x
+
+
+class TransformerBlock(nn.Module):
+    def __init__(
+            self,
+            width: int,
+            layers: int,
+            heads: int,
+            mlp_size: int,
+            act_layer: Callable = nn.GELU,
+            norm_layer: Callable = nn.LayerNorm,
+    ):
+        super().__init__()
+        self.width = width
+        self.layers = layers
+
+        self.resblocks = nn.ModuleList([
+            VisualAttentionBlock(
+                width, heads, mlp_size, act_layer=act_layer, norm_layer=norm_layer)
+            for _ in range(layers)
+        ])
+    
+        self.gradient_checkpointing = False 
+    
+    def enable_gradient_checkpointing(self):
+        self.gradient_checkpointing = True
+    
+    def disable_gradient_checkpointing(self):
+        self.gradient_checkpointing = False
+
+    def get_cast_dtype(self) -> torch.dtype:
+        return self.resblocks[0].mlp.c_fc.weight.dtype
+
+    def get_cast_device(self) -> torch.device:
+        return self.resblocks[0].mlp.c_fc.weight.device
+
+
+    def forward(self, x: torch.Tensor, attn_mask: Optional[torch.Tensor] = None):
+        for r in self.resblocks:
+            if self.gradient_checkpointing and self.training:
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs)
+                    return custom_forward
+                x = torch.utils.checkpoint.checkpoint(create_custom_forward(r), x)
+            else:
+                x = r(x, attn_mask=attn_mask)
+        return x
+
+
+class VisionTransformer(nn.Module):
+
+    def __init__(
+            self,
+            image_size: int,
+            patch_size: int,
+            width: int,
+            layers: int,
+            heads: int,
+            mlp_size: int,
+            output_dim: int,
+            n_queries: int = 256,
+            **kwargs
+    ):
+        super().__init__()
+        image_height, image_width = self.image_size = (image_size, image_size)
+        patch_height, patch_width = self.patch_size = (patch_size, patch_size)
+        self.grid_size = (image_height // patch_height, image_width // patch_width)
+        self.output_dim = output_dim
+
+        self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)
+
+        # class embeddings and positional embeddings
+        scale = width ** -0.5
+        self.positional_embedding = nn.Parameter(scale * torch.randn(256, width))
+
+        norm_layer = partial(nn.LayerNorm, eps=1e-6)
+        act_layer = nn.GELU
+
+        self.ln_pre = norm_layer(width)
+        self.transformer = TransformerBlock(
+            width,
+            layers,
+            heads,
+            mlp_size,
+            act_layer=act_layer,
+            norm_layer=norm_layer,
+        )
+
+        self.attn_pool = Resampler(
+            grid_size=int(math.sqrt(n_queries)),
+            embed_dim=output_dim,
+            num_heads=output_dim // 128,
+            kv_dim=width,
+            norm_layer=norm_layer,
+        )
+        self.ln_post = norm_layer(output_dim)
+    
+    def gradient_checkpointing_enable(self):
+        self.transformer.enable_gradient_checkpointing()
+    
+    def gradient_checkpointing_disable(self):
+        self.transformer.disable_gradient_checkpointing()   
+
+    def forward(self, x: torch.Tensor):
+        x = x.to(
+            dtype=self.transformer.get_cast_dtype(),
+            device=self.transformer.get_cast_device(),
+        )
+        # to patches
+        x = self.conv1(x)  # shape = [*, width, grid, grid]
+        x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
+        x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
+
+        x = x + get_abs_pos(self.positional_embedding, x.size(1))
+
+        x = self.ln_pre(x)
+
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.transformer(x)
+        x = x.permute(1, 0, 2)  # LND -> NLD
+
+        x = self.attn_pool(x)
+        x = self.ln_post(x)
+
+        return x
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/utils/model/vis_proj.py b/applications/DeepSpeed-VisualChat/utils/model/vis_proj.py
new file mode 100644
index 000000000..4c5ac875f
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/model/vis_proj.py
@@ -0,0 +1,153 @@
+import torch
+import torch.nn.functional as F
+from transformers.models.clip.modeling_clip import CLIPEncoderLayer
+from torch import nn
+import os
+import sys
+import math
+# sys.path.append('/vc_data/users/xwu/image-language/DeepSpeedExamples-internal-high-loss/applications/DeepSpeed-Chat-multi-modal/training/utils')
+sys.path.append(
+    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))
+import numpy as np
+from torch.nn.init import trunc_normal_
+
+
+class VisProjection_vit(nn.Module):
+    def __init__(self, vis_config, lang_dim):
+        super().__init__()
+        # TODO: for now, hard-coded for ViT
+        self.vis_layer = CLIPEncoderLayer(vis_config)
+        self.projection = nn.Sequential( 
+            nn.Linear(vis_config.hidden_size, lang_dim), # an example implementation
+            nn.LayerNorm(lang_dim, eps=1e-12))
+    def forward(self, vis_input):
+        vis_feature = self.vis_layer(vis_input, None, None)[0] # only need the first output
+        return self.projection(vis_feature)
+    
+
+# The following code is adopted from QWen-Clip
+def get_abs_pos(abs_pos, tgt_size):
+    # abs_pos: L, C
+    # tgt_size: M
+    # return: M, C
+    src_size = int(math.sqrt(abs_pos.size(0)))
+    tgt_size = int(math.sqrt(tgt_size))
+    dtype = abs_pos.dtype
+
+    if src_size != tgt_size:
+        return F.interpolate(
+            abs_pos.float().reshape(1, src_size, src_size, -1).permute(0, 3, 1, 2),
+            size=(tgt_size, tgt_size),
+            mode="bicubic",
+            align_corners=False,
+        ).permute(0, 2, 3, 1).flatten(0, 2).to(dtype=dtype)
+    else:
+        return abs_pos
+
+# https://github.com/facebookresearch/mae/blob/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/pos_embed.py#L20
+def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):
+    """
+    grid_size: int of the grid height and width
+    return:
+    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    grid_h = np.arange(grid_size, dtype=np.float32)
+    grid_w = np.arange(grid_size, dtype=np.float32)
+    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+    grid = np.stack(grid, axis=0)
+
+    grid = grid.reshape([2, 1, grid_size, grid_size])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if cls_token:
+        pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed], axis=0)
+    return pos_embed
+
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+    assert embed_dim % 2 == 0
+
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
+
+    emb = np.concatenate([emb_h, emb_w], axis=1) # (H*W, D)
+    return emb
+
+
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position
+    pos: a list of positions to be encoded: size (M,)
+    out: (M, D)
+    """
+    assert embed_dim % 2 == 0
+    omega = np.arange(embed_dim // 2, dtype=np.float32)
+    omega /= embed_dim / 2.
+    omega = 1. / 10000**omega  # (D/2,)
+
+    pos = pos.reshape(-1)  # (M,)
+    out = np.einsum('m,d->md', pos, omega)  # (M, D/2), outer product
+
+    emb_sin = np.sin(out) # (M, D/2)
+    emb_cos = np.cos(out) # (M, D/2)
+
+    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
+    return emb
+
+
+class VisProjection_perceiver(nn.Module):
+    def __init__(self, vis_config, lang_dim):
+        super().__init__()
+        # TODO: for now, hard-coded for perceiver
+        grid_size = 16
+        self.num_queries = grid_size ** 2
+        self.embed_dim = lang_dim
+        self.num_heads = lang_dim // 128 
+
+        self.pos_embed = nn.Parameter(
+            torch.from_numpy(get_2d_sincos_pos_embed(lang_dim, grid_size)).float()
+        ).requires_grad_(False)
+
+        self.query = nn.Parameter(torch.zeros(self.num_queries, lang_dim))
+        trunc_normal_(self.query, std=.02)
+
+        self.kv_proj = nn.Linear(vis_config.hidden_size, lang_dim) 
+
+        self.attn = nn.MultiheadAttention(lang_dim, self.num_heads)
+        self.ln_q = nn.LayerNorm(lang_dim, eps=1e-12)
+        self.ln_kv = nn.LayerNorm(lang_dim, eps=1e-12)
+        self.projection = nn.Sequential(
+            nn.LayerNorm(lang_dim, eps=1e-12), 
+            nn.Linear(lang_dim, lang_dim) # an example implementation
+            )
+        
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    def forward(self, x, attn_mask=None):
+        # import pdb; pdb.set_trace()
+        pos_embed = get_abs_pos(self.pos_embed, x.size(1))
+
+        x = x[:, 1:, :] # remove cls token
+        x = self.kv_proj(x)
+        x = self.ln_kv(x).permute(1, 0, 2)
+
+
+        N = x.shape[1]
+        q = self.ln_q(self.query)
+        out = self.attn(
+            self._repeat(q, N) + self.pos_embed.unsqueeze(1),
+            x + pos_embed.unsqueeze(1),
+            x,
+            attn_mask=attn_mask)[0]
+        return self.projection(out.permute(1, 0, 2))
+
+    def _repeat(self, query, N: int):
+        return query.unsqueeze(1).repeat(1, N, 1)
\ No newline at end of file
diff --git a/applications/DeepSpeed-VisualChat/utils/module/lora.py b/applications/DeepSpeed-VisualChat/utils/module/lora.py
new file mode 100644
index 000000000..67e446033
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/module/lora.py
@@ -0,0 +1,147 @@
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+import math
+import torch
+from torch import nn
+import torch.nn.functional as F
+from deepspeed.compression.helper import recursive_getattr, recursive_setattr
+import deepspeed
+
+
+class LinearLayer_LoRA(nn.Module):
+    # an simple implementation of LoRA
+    # for now only support Linear Layer
+    def __init__(self,
+                 weight,
+                 lora_dim=0,
+                 lora_scaling=1,
+                 lora_droppout=0,
+                 bias=None):
+        super(LinearLayer_LoRA, self).__init__()
+        self.weight = weight
+        self.bias = bias
+
+        if lora_dim <= 0:
+            raise ValueError(
+                "You are training to use LoRA, whose reduced dim should be larger than 1"
+            )
+
+        try:
+            # for zero stage 3
+            rows, columns = weight.ds_shape
+        except:
+            rows, columns = weight.shape
+        self.lora_right_weight = nn.Parameter(torch.zeros(
+            columns,
+            lora_dim))  # apply transpose so in forward we do not need to
+        self.lora_left_weight = nn.Parameter(torch.zeros(lora_dim, rows))
+        self.lora_scaling = lora_scaling / lora_dim
+
+        if lora_droppout > 0:
+            self.lora_dropout = nn.Dropout(lora_droppout)
+        else:
+            self.lora_dropout = nn.Identity()
+
+        self.reset_parameters()
+        # disable the original weight gradient
+        self.weight.requires_grad = False
+        # fuse LoRA to the original weight
+        self.fuse_lora = False
+
+    def eval(self):
+        self.lora_dropout.eval()
+        
+    def train(self, mode=True):
+        self.lora_dropout.train(mode)
+
+    def reset_parameters(self):
+        nn.init.kaiming_uniform_(self.lora_right_weight, a=math.sqrt(5))
+        nn.init.zeros_(self.lora_left_weight)
+
+    def fuse_lora_weight(self):
+        if not self.fuse_lora:
+            self.weight.data += self.lora_scaling * torch.matmul(
+                self.lora_left_weight.t(), self.lora_right_weight.t())
+        self.fuse_lora = True
+
+    def unfuse_lora_weight(self):
+        if self.fuse_lora:
+            self.weight.data -= self.lora_scaling * torch.matmul(
+                self.lora_left_weight.t(), self.lora_right_weight.t())
+        self.fuse_lora = False
+
+    def forward(self, input):
+        if self.fuse_lora:
+            return F.linear(input, self.weight, self.bias)
+        else:
+            return F.linear(
+                input, self.weight,
+                self.bias) + (self.lora_dropout(input) @ self.lora_right_weight
+                              @ self.lora_left_weight) * self.lora_scaling
+
+
+# convert the linear layer to LoRA
+def convert_linear_layer_to_lora(model,
+                                 part_module_name,
+                                 lora_dim=0,
+                                 lora_scaling=1,
+                                 lora_droppout=0):
+    repalce_name = []
+    for name, module in model.named_modules():
+        if isinstance(module, nn.Linear) and part_module_name in name:
+            repalce_name.append(name)
+    for name in repalce_name:
+        module = recursive_getattr(model, name)
+        tmp = LinearLayer_LoRA(
+            module.weight, lora_dim, lora_scaling, lora_droppout,
+            module.bias).to(module.weight.device).to(module.weight.dtype)
+        recursive_setattr(model, name, tmp)
+    return model
+
+
+def _z3_params_to_fetch(param_list):
+    return [
+        p for p in param_list
+        if hasattr(p, 'ds_id') and p.ds_status == deepspeed.runtime.zero.
+        partition_parameters.ZeroParamStatus.NOT_AVAILABLE
+    ]
+
+
+# convert the LoRA layer to linear layer
+def convert_lora_to_linear_layer(model, fuse_lora=True):
+    repalce_name = []
+    for name, module in model.named_modules():
+        if isinstance(module, LinearLayer_LoRA):
+            repalce_name.append(name)
+    for name in repalce_name:
+        module = recursive_getattr(model, name)
+        zero_stage_3 = hasattr(module.weight, 'ds_id')
+        with deepspeed.zero.GatheredParameters(_z3_params_to_fetch([
+                module.weight, module.bias, module.lora_left_weight,
+                module.lora_right_weight
+        ]),
+                                               modifier_rank=0,
+                                               enabled=zero_stage_3):
+            if fuse_lora:
+                module.fuse_lora_weight()
+            else:
+                module.unfuse_lora_weight()
+    return model
+
+def fuse_lora(model):
+    return convert_lora_to_linear_layer(model, fuse_lora=True)
+
+def unfuse_lora(model):
+    return convert_lora_to_linear_layer(model, fuse_lora=False)
+
+
+def only_optimize_lora_parameters(model):
+    # turn off the gradient of all the parameters except the LoRA parameters
+    for name, param in model.named_parameters():
+        if "lora_right_weight" in name or "lora_left_weight" in name:
+            param.requires_grad = True
+        else:
+            param.requires_grad = False
+    return model
diff --git a/applications/DeepSpeed-VisualChat/utils/utils.py b/applications/DeepSpeed-VisualChat/utils/utils.py
new file mode 100644
index 000000000..1853b8a23
--- /dev/null
+++ b/applications/DeepSpeed-VisualChat/utils/utils.py
@@ -0,0 +1,206 @@
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+import os
+import torch
+import random
+import numpy as np
+from transformers import set_seed, AutoTokenizer
+import json
+import deepspeed
+from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
+
+
+def print_rank_0(msg, rank=None):
+    if rank is not None and rank <= 0:
+        print(msg)
+    elif is_rank_0():
+        print(msg)
+
+def is_rank_0():
+    """Check whether it is rank 0."""
+    if torch.distributed.is_initialized():
+        if torch.distributed.get_rank() == 0:
+            return True
+        else:
+            return False
+    else:
+        return True
+
+def get_rank():
+    """Check whether it is rank 0."""
+    if torch.distributed.is_initialized():
+        return torch.distributed.get_rank()
+    else:
+        return 0
+
+def to_device(batch, device):
+    output = {}
+    for k, v in batch.items():
+        try:
+            output[k] = v.to(device)
+        except:
+            output[k] = v
+    return output
+
+
+class MovingAverage:
+
+    def __init__(self):
+        self.count = 0
+        self.total = 0
+        self.mean = 0
+
+    def update(self, num):
+        self.total += num
+        self.count += 1
+        self.mean = self.total / self.count
+
+        return self.mean
+
+
+def set_random_seed(seed):
+    if seed is not None:
+        set_seed(seed)
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.manual_seed(seed)
+        torch.cuda.manual_seed_all(seed)
+
+
+def get_all_reduce_mean(tensor):
+    torch.distributed.all_reduce(tensor, op=torch.distributed.ReduceOp.SUM)
+    tensor = tensor / torch.distributed.get_world_size()
+    return tensor
+
+
+def get_optimizer_grouped_parameters(model,
+                                     weight_decay,
+                                     no_decay_name_list=[
+                                         "bias", "LayerNorm.weight"
+                                     ],
+                                     small_learning_rate_list=
+                                     ["embed"], small_lr=1e-4):
+    
+    optimizer_grouped_parameters = [
+        {
+            "params": [
+                p for n, p in model.named_parameters()
+                if (not any(nd in n
+                            for nd in no_decay_name_list) and (not any(nd in n
+                            for nd in small_learning_rate_list)) and p.requires_grad)
+            ],
+            "weight_decay":
+            weight_decay,
+        },
+        {
+            "params": [
+                p for n, p in model.named_parameters()
+                if (any(nd in n
+                        for nd in no_decay_name_list) and (not any(nd in n
+                            for nd in small_learning_rate_list)) and p.requires_grad)
+            ],
+            "weight_decay":
+            0.0,
+        },
+        {
+            "params": [
+                p for n, p in model.named_parameters()
+                if (not any(nd in n
+                            for nd in no_decay_name_list) and (any(nd in n
+                            for nd in small_learning_rate_list)) and p.requires_grad)
+            ],
+            "weight_decay":
+            weight_decay,
+            "lr": small_lr
+        },
+        {
+            "params": [
+                p for n, p in model.named_parameters()
+                if (any(nd in n
+                        for nd in no_decay_name_list) and (any(nd in n
+                            for nd in small_learning_rate_list)) and p.requires_grad)
+            ],
+            "weight_decay":
+            0.0,
+            "lr": small_lr
+        },
+    ]
+    return optimizer_grouped_parameters
+
+
+def _z3_params_to_fetch(param_list):
+    return [
+        p for p in param_list
+        if hasattr(p, 'ds_id') and p.ds_status == ZeroParamStatus.NOT_AVAILABLE
+    ]
+
+
+def moving_average(model, model_ema, beta=0.992, device=None, zero_stage=0):
+    zero_stage_3 = (zero_stage == 3)
+    with torch.no_grad():
+        for param, param_ema in zip(model.parameters(),
+                                    model_ema.parameters()):
+            # TODO: use prefiltering for efficiency
+            params_to_fetch = _z3_params_to_fetch([param, param_ema
+                                                   ]) if zero_stage_3 else []
+            should_gather_param = len(params_to_fetch) > 0
+            with deepspeed.zero.GatheredParameters(
+                    params_to_fetch, enabled=should_gather_param):
+                data = param.data
+                if device is not None:
+                    data = data.to(device)
+                param_ema.data.copy_(torch.lerp(data, param_ema.data, beta))
+
+def save_hf_format(model, tokenizer, args, sub_folder=""):
+    # used to save huggingface format, so we can use it for hf.from_pretrained
+    model_to_save = model.module if hasattr(model, 'module') else model
+    CONFIG_NAME = "config.json"
+    WEIGHTS_NAME = "pytorch_model.bin"
+    output_dir = os.path.join(args.output_dir, sub_folder)
+    os.makedirs(output_dir, exist_ok=True)
+    output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
+    output_config_file = os.path.join(output_dir, CONFIG_NAME)
+    save_dict = model_to_save.state_dict()
+    # for key in list(save_dict.keys()):
+    #     if "lora" in key:
+    #         del save_dict[key]
+    torch.save(save_dict, output_model_file)
+    try:
+        model_to_save.config.to_json_file(output_config_file)
+    except:
+        args_dict = vars(args)
+        torch.save(args_dict,os.path.join(output_dir, 'train_args.pt'))
+        print ("config can't be saved")
+    # tokenizer.save_vocabulary(output_dir)
+    tokenizer.save_pretrained(output_dir)  # this will save all tokenizer files
+
+def save_zero_three_model(model_ema, global_rank, save_dir, zero_stage=0, sub_folder=""):
+    zero_stage_3 = (zero_stage == 3)
+    output_dir = os.path.join(save_dir, sub_folder)
+    os.makedirs(output_dir, exist_ok=True)
+    WEIGHTS_NAME = "pytorch_model.bin"
+    output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
+
+    model_to_save = model_ema.module if hasattr(model_ema,
+                                                'module') else model_ema
+    if not zero_stage_3:
+        if global_rank == 0:
+            torch.save(model_to_save.state_dict(), output_model_file)
+    else:
+        output_state_dict = {}
+        for k, v in model_to_save.named_parameters():
+
+            if hasattr(v, 'ds_id'):
+                with deepspeed.zero.GatheredParameters(_z3_params_to_fetch([v
+                                                                            ]),
+                                                       enabled=zero_stage_3):
+                    v_p = v.data.clone().detach().cpu() # this is a hack to get around the fact that we can't get the data from the param
+            else:
+                v_p = v.cpu()
+            if global_rank == 0 and "lora" not in k:
+                output_state_dict[k] = v_p
+        if global_rank == 0:
+            torch.save(output_state_dict, output_model_file)
+        del output_state_dict