Skip to content

Commit

Permalink
Added mixtral to the benchmark list, improved benchmark docs
Browse files Browse the repository at this point in the history
  • Loading branch information
arjunsuresh committed Jul 3, 2024
1 parent 165f5f0 commit 0718769
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 23 deletions.
25 changes: 12 additions & 13 deletions docs/benchmarks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,26 @@

Please visit the individual benchmark links to see the run commands using the unified CM interface.

1. [Image Classification](image_classification/resnet50.md) using ResNet50 model and Imagenet-2012 dataset
1. [Image Classification](image_classification/resnet50.md) using ResNet50-v1.5 model and Imagenet-2012 (224x224) validation dataset. Dataset size is 50,000 and QSL size is 1024. Reference model accuracy is 76.46%. Server scenario latency constraint is 15ms.

2. [Text to Image](text_to_image/sdxl.md) using Stable Diffusion model and Coco2014 dataset
2. [Text to Image](text_to_image/sdxl.md) using Stable Diffusion model and subset of Coco2014 dataset. Dataset size is 5000 amd QSL size is the same. Required accuracy for closed division is (23.01085758 <= FID <= 23.95007626, 32.68631873 <= CLIP <= 31.81331801).

3. [Object Detection](object_detection/retinanet.md) using Retinanet model and OpenImages dataset
3. [Object Detection](object_detection/retinanet.md) using Retinanet model and OpenImages dataset.Dataset size is 24781 and QSL size is 64. Reference model accuracy is 0.3755 mAP. Server scenario latency constraint is 100ms.

4. [Image Segmentation](medical_imaging/3d-unet.md) using 3d-unet model and KiTS19 dataset
4. [Medical Image Segmentation](medical_imaging/3d-unet.md) using 3d-unet model and KiTS2019 dataset. Dataset size is 42 and QSL size is the same. Reference model accuracy is 0.86330 mean DIXE score. Server scenario is not applicable.

5. [Question Answering](language/bert.md) using Bert-Large model and Squad v1.1 dataset
5. [Question Answering](language/bert.md) using Bert-Large model and Squad v1.1 dataset with 384 sequence length. Dataset size is 10833 and QSL size is the same. Reference model accuracy is f1 score = 90.874%. Server scenario latency constraint is 130ms.

6. [Text Summarization](language/gpt-j.md) using GPT-J model and CNN Daily Mail dataset
6. [Text Summarization](language/gpt-j.md) using GPT-J model and CNN Daily Mail v3.0.0 dataset. Dataset size is 13368 amd QSL size is the same. Reference model accuracy is (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881, gen_len=4016878). Server scenario latency sconstraint is 20s.

7. [Text Summarization](language/llama2-70b.md) using LLAMA2-70b model and OpenORCA dataset
7. [Question Answering](language/llama2-70b.md) using LLAMA2-70b model and OpenORCA (GPT-4 split, max_seq_len=1024) dataset. Dataset size is 24576 and QSL size is the same. Reference model accuracy is (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms.

8. [Recommendation](recommendation/dlrm-v2.md) using DLRMv2 model and Criteo multihot dataset
8. [Question Answering, Math and Code Generation](language/mixtral-8x7b.md) using Mixtral-8x7B model and OpenORCA (5k samples of GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) datasets. Dataset size is 15000 and QSL size is the same. Reference model accuracy is (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, gsm8k accuracy = 73.78, mbxp accuracy = 60.12, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms.

All the eight benchmarks can participate in the datacenter category.
All the eight benchmarks except DLRMv2 and LLAMA2 and can participate in the edge category.
9. [Recommendation](recommendation/dlrm-v2.md) using DLRMv2 model and Synthetic Multihot Criteo dataset. Dataset size is 204800 and QSL size is the same. Reference model accuracy is AUC=80.31%. Server scenario latency constraint is 60 ms.

All the nine benchmarks can participate in the datacenter category.
All the nine benchmarks except DLRMv2, LLAMA2 and Mixtral-8x7B and can participate in the edge category.

`bert`, `llama2-70b`, `dlrm_v2` and `3d-unet` has a high accuracy (99.9%) variant, where the benchmark run must achieve a higher accuracy of at least `99.9%` of the FP32 reference model
in comparison with the `99%` default accuracy requirement.

The `dlrm_v2` benchmark has a high-accuracy variant only. If this accuracy is not met, the submission result can be submitted only to the open division.

24 changes: 15 additions & 9 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
content=""
scenarios = []
execution_envs = ["Docker","Native"]
code_version="r4.1"

if model == "rnnt":
code_version="r4.0"

if implementation == "reference":
devices = [ "CPU", "CUDA", "ROCm" ]
Expand All @@ -31,8 +35,10 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
frameworks = [ "TensorRT" ]

elif implementation == "intel":
if model not in [ "bert-99", "bert-99.9", "gptj-99", "gptj-99.9" ]:
if model not in [ "bert-99", "bert-99.9", "gptj-99", "gptj-99.9", "resnet50", "retinanet", "3d-unet-99", "3d-unet-99.9" ]:
return pre_space+" WIP"
if model in [ "bert-99", "bert-99.9", "retinanet", "3d-unet-99", "3d-unet-99.9" ]:
code_version="r4.0"
devices = [ "CPU" ]
frameworks = [ "Pytorch" ]

Expand Down Expand Up @@ -109,14 +115,14 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
content += f"{cur_space3}####### Setup a virtual environment for Python\n"
content += get_venv_command(spaces+16)
content += f"{cur_space3}####### Performance Estimation for Offline Scenario\n"
content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True).replace("--docker ","")
content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True, scenarios, code_version).replace("--docker ","")
content += f"{cur_space3}The above command should do a test run of Offline scenario and record the estimated offline_target_qps.\n\n"

else: # Docker implementation steps
content += f"{cur_space3}####### Docker Container Build and Performance Estimation for Offline Scenario\n"
docker_info = get_docker_info(spaces+16, model, implementation, device)
content += docker_info
content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True)
content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True, scenarios, code_version)
content += f"{cur_space3}The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.\n\n"
content += f"{cur_space3}<details>\n"
content += f"{cur_space3}<summary> Please click here to see more options for the docker launch </summary>\n\n"
Expand All @@ -131,7 +137,7 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):
else:
content += f"{cur_space3} You can reuse the same environment as described for {model.split('.')[0]}.\n"
content += f"{cur_space3}###### Performance Estimation for Offline Scenario\n"
content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True).replace("--docker ","")
content += mlperf_inference_run_command(spaces+17, model, implementation, framework.lower(), category.lower(), "Offline", device.lower(), "test", test_query_count, True, scenarios, code_version).replace("--docker ","")
content += f"{cur_space3}The above command should do a test run of Offline scenario and record the estimated offline_target_qps.\n\n"


Expand All @@ -144,12 +150,12 @@ def mlperf_inference_implementation_readme(spaces, model, implementation):

for scenario in scenarios:
content += f"{cur_space3}=== \"{scenario}\"\n{cur_space4}###### {scenario}\n\n"
run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), scenario, device.lower(), "valid", scenarios)
run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), scenario, device.lower(), "valid", 0, False, scenarios, code_version)
content += run_cmd
#content += run_suffix

content += f"{cur_space3}=== \"All Scenarios\"\n{cur_space4}###### All Scenarios\n\n"
run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), "All Scenarios", device.lower(), "valid", scenarios)
run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), "All Scenarios", device.lower(), "valid", scenarios, code_version)
content += run_cmd
content += run_suffix

Expand Down Expand Up @@ -235,7 +241,7 @@ def get_run_cmd_extra(f_pre_space, model, implementation, device, scenario, scen
return extra_content

@env.macro
def mlperf_inference_run_command(spaces, model, implementation, framework, category, scenario, device="cpu", execution_mode="test", test_query_count="20", docker=False, scenarios = []):
def mlperf_inference_run_command(spaces, model, implementation, framework, category, scenario, device="cpu", execution_mode="test", test_query_count="20", docker=False, scenarios = [], code_version="r4.1"):
pre_space = ""
for i in range(1,spaces):
pre_space = pre_space + " "
Expand All @@ -260,7 +266,7 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ

docker_setup_cmd = f"""\n
{f_pre_space}```bash
{f_pre_space}cm run script --tags=run-mlperf,inference,_find-performance,_full{scenario_variation_tag} \\
{f_pre_space}cm run script --tags=run-mlperf,inference,_find-performance,_full,_{code_version}{scenario_variation_tag} \\
{pre_space} --model={model} \\
{pre_space} --implementation={implementation} \\
{pre_space} --framework={framework} \\
Expand All @@ -279,7 +285,7 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ

run_cmd = f"""\n
{f_pre_space}```bash
{f_pre_space}cm run script --tags=run-mlperf,inference{scenario_variation_tag} \\
{f_pre_space}cm run script --tags=run-mlperf,inference,_{code_version}{scenario_variation_tag} \\
{pre_space} --model={model} \\
{pre_space} --implementation={implementation} \\
{pre_space} --framework={framework} \\
Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ nav:
- Bert-Large: benchmarks/language/bert.md
- GPT-J: benchmarks/language/gpt-j.md
- LLAMA2-70B: benchmarks/language/llama2-70b.md
- MIXTRAL-8x7b: benchmarks/language/mixtral-8x7b.md
- MIXTRAL-8x7B: benchmarks/language/mixtral-8x7b.md
- Recommendation:
- DLRM-v2: benchmarks/recommendation/dlrm-v2.md
- Submission:
Expand Down

0 comments on commit 0718769

Please sign in to comment.