Added mixtral to the benchmark list, improved benchmark docs

GATEOverflow · Jul 3, 2024 · 0718769 · 0718769
1 parent 165f5f0
commit 0718769
Show file tree

Hide file tree

Showing 3 changed files with 28 additions and 23 deletions.
diff --git a/docs/benchmarks/index.md b/docs/benchmarks/index.md
@@ -2,27 +2,26 @@
 
 Please visit the individual benchmark links to see the run commands using the unified CM interface.
 
-1. [Image Classification](image_classification/resnet50.md) using ResNet50 model and Imagenet-2012 dataset
+1. [Image Classification](image_classification/resnet50.md) using ResNet50-v1.5 model and Imagenet-2012 (224x224) validation dataset. Dataset size is 50,000 and QSL size is 1024. Reference model accuracy is 76.46%. Server scenario latency constraint is 15ms.
 
-2. [Text to Image](text_to_image/sdxl.md) using Stable Diffusion model and Coco2014 dataset
+2. [Text to Image](text_to_image/sdxl.md) using Stable Diffusion model and subset of Coco2014 dataset. Dataset size is 5000 amd QSL size is the same. Required accuracy for closed division is (23.01085758 <= FID <= 23.95007626, 32.68631873 <= CLIP <= 31.81331801).
 
-3. [Object Detection](object_detection/retinanet.md) using Retinanet model and OpenImages dataset
+3. [Object Detection](object_detection/retinanet.md) using Retinanet model and OpenImages dataset.Dataset size is 24781 and QSL size is 64. Reference model accuracy is 0.3755 mAP. Server scenario latency constraint is 100ms.
 
-4. [Image Segmentation](medical_imaging/3d-unet.md)  using 3d-unet model and KiTS19 dataset
+4. [Medical Image Segmentation](medical_imaging/3d-unet.md)  using 3d-unet model and KiTS2019 dataset. Dataset size is 42 and QSL size is the same. Reference model accuracy is 0.86330 mean DIXE score. Server scenario is not applicable.
 
-5. [Question Answering](language/bert.md) using Bert-Large model and Squad v1.1 dataset
+5. [Question Answering](language/bert.md) using Bert-Large model and Squad v1.1 dataset with 384 sequence length. Dataset size is 10833 and QSL size is the same. Reference model accuracy is f1 score = 90.874%. Server scenario latency constraint is 130ms.
 
-6. [Text Summarization](language/gpt-j.md) using GPT-J model and CNN Daily Mail dataset
+6. [Text Summarization](language/gpt-j.md) using GPT-J model and CNN Daily Mail v3.0.0 dataset. Dataset size is 13368 amd QSL size is the same. Reference model accuracy is (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881, gen_len=4016878). Server scenario latency sconstraint is 20s.
 
-7. [Text Summarization](language/llama2-70b.md) using LLAMA2-70b model and OpenORCA dataset
+7. [Question Answering](language/llama2-70b.md) using LLAMA2-70b model and OpenORCA (GPT-4 split, max_seq_len=1024) dataset. Dataset size is 24576 and QSL size is the same. Reference model accuracy is (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms.
 
-8. [Recommendation](recommendation/dlrm-v2.md) using DLRMv2 model and Criteo multihot dataset
+8. [Question Answering, Math and Code Generation](language/mixtral-8x7b.md) using Mixtral-8x7B model and OpenORCA (5k samples of GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) datasets. Dataset size is 15000 and QSL size is the same. Reference model accuracy is (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, gsm8k accuracy = 73.78, mbxp accuracy = 60.12, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms.
 
-All the eight benchmarks can participate in the datacenter category.
-All the eight benchmarks except DLRMv2 and LLAMA2 and can participate in the edge category. 
+9. [Recommendation](recommendation/dlrm-v2.md) using DLRMv2 model and Synthetic Multihot Criteo dataset. Dataset size is 204800 and QSL size is the same. Reference model accuracy is AUC=80.31%. Server scenario latency constraint is 60 ms. 
+
+All the nine benchmarks can participate in the datacenter category.
+All the nine benchmarks except DLRMv2, LLAMA2 and Mixtral-8x7B and can participate in the edge category. 
 
 `bert`, `llama2-70b`, `dlrm_v2` and `3d-unet` has a high accuracy (99.9%) variant, where the benchmark run  must achieve a higher accuracy of at least `99.9%` of the FP32 reference model
 in comparison with the `99%` default accuracy requirement.
-
-The `dlrm_v2` benchmark has a high-accuracy variant only. If this accuracy is not met, the submission result can be submitted only to the open division.
-
diff --git a/main.py b/main.py
@@ -12,6 +12,10 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
         content=""
         scenarios = []
         execution_envs = ["Docker","Native"]
+        code_version="r4.1"
+
+        if model == "rnnt":
+            code_version="r4.0"
 
         if implementation == "reference":
             devices = [ "CPU", "CUDA", "ROCm" ]
@@ -31,8 +35,10 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
             frameworks = [ "TensorRT" ]
 
         elif implementation == "intel":
-            if model not in [ "bert-99", "bert-99.9", "gptj-99", "gptj-99.9" ]:
+            if model not in [ "bert-99", "bert-99.9", "gptj-99", "gptj-99.9", "resnet50", "retinanet", "3d-unet-99", "3d-unet-99.9" ]:
                  return pre_space+"    WIP"
+            if model in [ "bert-99", "bert-99.9", "retinanet", "3d-unet-99", "3d-unet-99.9" ]:
+                 code_version="r4.0"
             devices = [ "CPU" ]
             frameworks = [ "Pytorch" ]
 
@@ -109,14 +115,14 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
                                 content += f"{cur_space3}####### Setup a virtual environment for Python\n"
                                 content += get_venv_command(spaces+16)
                                 content += f"{cur_space3}####### Performance Estimation for Offline Scenario\n"
-                                content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True).replace("--docker ","")
+                                content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True, scenarios, code_version).replace("--docker ","")
                                 content += f"{cur_space3}The above command should do a test run of Offline scenario and record the estimated offline_target_qps.\n\n"
 
                             else: # Docker implementation steps
                                 content += f"{cur_space3}####### Docker Container Build and Performance Estimation for Offline Scenario\n"
                                 docker_info = get_docker_info(spaces+16, model, implementation, device)
                                 content += docker_info
-                                content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True)
+                                content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True, scenarios, code_version)
                                 content += f"{cur_space3}The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.\n\n"
                                 content += f"{cur_space3}<details>\n"
                                 content += f"{cur_space3}<summary> Please click here to see more options for the docker launch </summary>\n\n"
@@ -131,7 +137,7 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
                         else:
                             content += f"{cur_space3} You can reuse the same environment as described for {model.split('.')[0]}.\n"
                             content += f"{cur_space3}###### Performance Estimation for Offline Scenario\n"
-                            content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True).replace("--docker ","")
+                            content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True, scenarios, code_version).replace("--docker ","")
                             content += f"{cur_space3}The above command should do a test run of Offline scenario and record the estimated offline_target_qps.\n\n"
 
 
@@ -144,12 +150,12 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
 
                         for scenario in scenarios:
                             content += f"{cur_space3}=== \"{scenario}\"\n{cur_space4}###### {scenario}\n\n"
-                            run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), scenario, device.lower(), "valid", scenarios)
+                            run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), scenario, device.lower(), "valid", 0, False, scenarios, code_version)
                             content += run_cmd
                             #content += run_suffix
 
                         content += f"{cur_space3}=== \"All Scenarios\"\n{cur_space4}###### All Scenarios\n\n"
-                        run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), "All Scenarios", device.lower(), "valid", scenarios)
+                        run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), "All Scenarios", device.lower(), "valid", scenarios, code_version)
                         content += run_cmd
                         content += run_suffix
 
@@ -235,7 +241,7 @@ def get_run_cmd_extra(f_pre_space, model, implementation, device, scenario, scen
         return extra_content
 
     @env.macro
-    def mlperf_inference_run_command(spaces, model, implementation, framework, category, scenario, device="cpu", execution_mode="test", test_query_count="20", docker=False, scenarios = []):
+    def mlperf_inference_run_command(spaces, model, implementation, framework, category, scenario, device="cpu", execution_mode="test", test_query_count="20", docker=False, scenarios = [], code_version="r4.1"):
         pre_space = ""
         for i in range(1,spaces):
              pre_space  = pre_space + " "
@@ -260,7 +266,7 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ
 
             docker_setup_cmd = f"""\n
 {f_pre_space}```bash
-{f_pre_space}cm run script --tags=run-mlperf,inference,_find-performance,_full{scenario_variation_tag} \\
+{f_pre_space}cm run script --tags=run-mlperf,inference,_find-performance,_full,_{code_version}{scenario_variation_tag} \\
 {pre_space} --model={model} \\
 {pre_space} --implementation={implementation} \\
 {pre_space} --framework={framework} \\
@@ -279,7 +285,7 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ
 
             run_cmd = f"""\n
 {f_pre_space}```bash
-{f_pre_space}cm run script --tags=run-mlperf,inference{scenario_variation_tag} \\
+{f_pre_space}cm run script --tags=run-mlperf,inference,_{code_version}{scenario_variation_tag} \\
 {pre_space} --model={model} \\
 {pre_space} --implementation={implementation} \\
 {pre_space} --framework={framework} \\

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -35,7 +35,7 @@ nav:
       - Bert-Large: benchmarks/language/bert.md
       - GPT-J: benchmarks/language/gpt-j.md
       - LLAMA2-70B: benchmarks/language/llama2-70b.md
-      - MIXTRAL-8x7b: benchmarks/language/mixtral-8x7b.md
+      - MIXTRAL-8x7B: benchmarks/language/mixtral-8x7b.md
     - Recommendation:
       - DLRM-v2: benchmarks/recommendation/dlrm-v2.md
   - Submission: