Skip to content

Releases: microsoft/superbenchmark

Release SuperBench v0.11.0

08 Oct 06:18
75dac87
Compare
Choose a tag to compare

SuperBench 0.11.0 Release Notes

SuperBench Improvements

  • Add CUDA 12.4 dockerfile.
  • Upgrade nccl version to v2.23.4 and install ucx v1.16.0 in cuda 12.4 dockefile.
  • Fix MSCCL build error in CUDA12.4 docker build pipeline.
  • Add ROCm6.2 dockerfile.
  • Update hpcx link in cuda11.1 dockerfile to fix docker build failure.
  • Improve document (Fix metrics name and typos in user tutorial, add BibTeX in README and repo).
  • Limit protobuf version to be 3.20.x to fix onnxruntime dependency error.
  • Update omegaconf version to 2.3.0 and fix issues caused by omegaconf version update.
  • Fix MSCCL build error in CUDA12.4 docker build pipeline.
  • Update Docker Exec Command for Persistent HPCX Environment.
  • Fix cuda 12.2 dockerfile LD_LIBRARY_PATH issue.
  • Use types-setuptools to replace types-pkg_resources.
  • Add configuration for NDv5 H100 and AMD MI300x.

Micro-benchmark Improvements

  • Add hipblasLt tuning to dist-inference cpp implementation.
  • Add support for NVIDIA L4/L40/L40s GPUs in gemm-flops.
  • Upgrade mlc to v3.11.

Model-benchmark Improvements

  • Support FP8 transformer model training in ROCm6.2 dockerfile.

Result Analysis

  • Fix bug of failure test and warning of pandas in data diagnosis.

Release SuperBench v0.10.0

03 Jan 00:10
9eb2bdf
Compare
Choose a tag to compare

SuperBench 0.10.0 Release Notes

SuperBench Improvements

  • Support monitoring for AMD GPUs.
  • Support ROCm 5.7 and ROCm 6.0 dockerfile.
  • Add MSCCL support for Nvidia GPU.
  • Fix NUMA domains swap issue in NDv4 topology file.
  • Add NDv5 topo file.
  • Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA 12.2.

Micro-benchmark Improvements

  • Add HPL random generator to gemm-flops with ROCm.
  • Add DirectXGPURenderFPS benchmark to measure the FPS of rendering simple frames.
  • Add HWDecoderFPS benchmark to measure the FPS of hardware decoder performance.
  • Update Docker image for H100 support.
  • Update MLC version into 3.10 for CUDA/ROCm dockerfile.
  • Bug fix for GPU Burn test.
  • Support INT8 in cublaslt function.
  • Add hipBLASLt function benchmark.
  • Support cpu-gpu and gpu-cpu in ib-validation.
  • Support graph mode in NCCL/RCCL benchmarks for latency metrics.
  • Support cpp implementation in distributed inference benchmark.
  • Add O2 option for gpu copy ROCm build.
  • Support different hipblasLt data types in dist inference.
  • Support in-place in NCCL/RCCL benchmark.
  • Support data type option in NCCL/RCCL benchmark.
  • Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs.
  • Update hipblaslt GEMM metric unit to tflops.
  • Support FP8 for hipblaslt benchmark.

Model Benchmark Improvements

  • Change torch.distributed.launch to torchrun.
  • Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark.

Result Analysis

  • Support baseline generation from multiple nodes.

Release SuperBench v0.9.0

26 Jul 07:13
1537a27
Compare
Choose a tag to compare

SuperBench 0.9.0 Release Notes

SuperBench Improvements

  • Support Ctrl+C and interrupt to stop all SuperBench testing.
  • Support Windows Docker for VDI/Gaming GPU.
  • Support DirectX platform for Nvidia and AMD GPU.
  • Add System Config Info feature in SB runner to support distributed collection.
  • Support DirectX test pipeline.

Micro-benchmark Improvements

  • Add DirectXGPUCopyBw Benchmark to measure HtoD/DtoH bandwidth by DirectX.
  • Add DirectXGPUCoreFLops Benchmark to measure peak FLOPS by DirectX.
  • Add DirectXGPUMemBw Benchmark to measure GPU memory bandwidth by DirectX.
  • Add DirectXVCNEncodingLatency Benchmark to measure the VCN hardware encoding latency on AMD graphic GPUs.
  • Support best algorithm selection in cudnn-function microbenchmark.
  • Revise step time collection in distributed inference benchmark.

Model Benchmark Improvements

  • Fix early stop logic due to num_steps in model benchmarks.
  • Support TensorRT models on Nvidia H100.

Documentation Improvements

  • Improve documentation for System Config Info.
  • Update outdate references.

Release SuperBench v0.8.0

14 Apr 06:39
694ae2a
Compare
Choose a tag to compare

SuperBench 0.8.0 Release Notes

SuperBench Improvements

  • Support SuperBench Executor running on Windows.
  • Remove fixed rccl version in rocm5.1.x docker file.
  • Upgrade networkx version to fix installation compatibility issue.
  • Pin setuptools version to v65.7.0.
  • Limit ansible_runner version for Python 3.6.
  • Support cgroup V2 when read system metrics in monitor.
  • Fix analyzer bug in Python 3.8 due to pandas api change.
  • Collect real-time GPU power in monitor.
  • Remove unreachable condition when write host list in mpi mode.
  • Upgrade Docker image with cuda12.1, nccl 2.17.1-1, hpcx v2.14, and mlc 3.10.
  • Fix wrong unit of cpu-memory-bw-latency in document.

Micro-benchmark Improvements

  • Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate.
  • Add HPL Benchmark for HPC Linpack Benchmark.
  • Support flexible warmup and non-random data initialization in cublas-benchmark.
  • Support error tolerance in micro-benchmark for CuDNN function.
  • Add distributed inference benchmark.
  • Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm.

Model Benchmark Improvements

  • Fix torch.dist init issue with multiple models.
  • Support TE FP8 in BERT/GPT2 model.
  • Add num_workers configurations in model benchmark.

Release SuperBench v0.7.0

20 Jan 05:04
d76e4e1
Compare
Choose a tag to compare

SuperBench v0.7.0 Release Notes

SuperBench Improvements

  • Support non-zero return code when "sb deploy" or "sb run" fails in Ansible.
  • Support log flushing to the result file during runtime.
  • Update version to include revision hash and date.
  • Support "pattern" in mpi mode to run tasks in parallel.
  • Support topo-aware, all-pair, and K-batch pattern in mpi mode.
  • Fix Transformers version to avoid Tensorrt failure.
  • Add CUDA11.8 Docker image for NVIDIA arch90 GPUs.
  • Support "sb deploy" without pulling image.

Micro-benchmark Improvements

  • Support list of custom config string in cudnn-functions and cublas-functions.
  • Support correctness check in cublas-functions.
  • Support GEMM-FLOPS for NVIDIA arch90 GPUs.
  • Support cuBLASLt FP16 and FP8 GEMM.
  • Add wait time option to resolve mem-bw unstable issue.
  • Fix bug for incorrect datatype judgement in cublas-function source code.

Model Benchmark Improvements

  • Support FP8 in BERT model training.

Distributed Benchmark Improvements

  • Support pair-wise pattern in IB validation benchmark.
  • Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark.

Release SuperBench v0.6.0

06 Sep 06:09
09549b5
Compare
Choose a tag to compare

SuperBench v0.6.0 Release Notes

SuperBench Improvement

  • Support running on host directly without Docker.
  • Support running sb command inside docker image.
  • Support ROCm 5.1.1.
  • Support ROCm 5.1.3.
  • Fix bugs in data diagnosis.
  • Fix cmake and build issues.
  • Support automatic configuration yaml selection on Azure VM.
  • Refine error message when GPU is not detected.
  • Add return code for Timeout.
  • Update Dockerfile for NCCL/RCCL version, tag name, and verbose output.
  • Support node_num=1 in mpi mode.
  • Update Python setup for require packages.
  • Enhance parameter parsing to allow spaces in value.
  • Support NO_COLOR for SuperBench output.

Micro-benchmark Improvements

  • Fix issues in ib loopback benchmark.
  • Fix stability issue in ib loopback benchmark.

Distributed Benchmark Improvements

  • Enhance pair-wise IB benchmark.
  • Bug Fix in IB benchmark.
  • Support topology-aware IB benchmark.

Data Diagnosis and Analysis

  • Add failure check function in data_diagnosis.py.
  • Support JSON and JSONL in Diagnosis.
  • Add support to store values of metrics in data diagnosis.
  • Support exit code of sb result diagnosis.
  • Format int type and unify empty value to N/A in diagnosis output files.

Pre-release v0.6.0-rc1

08 Aug 09:47
9c29c93
Compare
Choose a tag to compare
Pre-release

Pre-release v0.6.0-rc1.

Release SuperBench v0.5.0

29 Apr 02:56
7f607e4
Compare
Choose a tag to compare

SuperBench 0.5.0 Release Notes

Micro-benchmark Improvements

  • Support NIC only NCCL bandwidth benchmark on single node in NCCL/RCCL bandwidth test.
  • Support bi-directional bandwidth benchmark in GPU copy bandwidth test.
  • Support data checking in GPU copy bandwidth test.
  • Update rccl-tests submodule to fix divide by zero error.
  • Add GPU-Burn micro-benchmark.

Model-benchmark Improvements

  • Sync results on root rank for e2e model benchmarks in distributed mode.
  • Support customized env in local and torch.distributed mode.
  • Add support for pytorch>=1.9.0.
  • Keep BatchNorm as fp32 for pytorch cnn models cast to fp16.
  • Remove FP16 samples type converting time.
  • Support FAMBench.

Inference Benchmark Improvements

  • Revise the default setting for inference benchmark.
  • Add percentile metrics for inference benchmarks.
  • Support T4 and A10 in GEMM benchmark.
  • Add configuration with inference benchmark.

Other Improvements

  • Add command to support listing all optional parameters for benchmarks.
  • Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file.
  • Support timeout to detect the benchmark failure and stop the process automatically.
  • Add rocm5.0 dockerfile.
  • Improve output interface.

Data Diagnosis and Analysis

  • Support multi-benchmark check.
  • Support result summary in md, html and excel formats.
  • Support data diagnosis in md and html formats.
  • Support result output for all nodes in data diagnosis.

Pre-release v0.5.0-rc1

25 Mar 05:58
84fed1c
Compare
Choose a tag to compare
Pre-release
Pre-release v0.5.0-rc1.

Release SuperBench v0.4.0

28 Dec 10:38
525cec7
Compare
Choose a tag to compare

SuperBench 0.4.0 Release Notes

SuperBench Framework

Monitor

  • Add monitor framework for NVIDIA GPU, CPU, memory and disk.

Data Diagnosis and Analysis

  • Support baseline-based data diagnosis.
  • Support basic analysis feature (boxplot figure, outlier detection, etc.).

Single-node Validation

Micro Benchmarks

  • CPU Memory Validation (tool: Intel Memory Latency Checker).
  • GPU Copy Bandwidth (tool: built by MSRA).
  • Add ORT Model on AMD GPU platform.
  • Add inference backend TensorRT.
  • Add inference backend ORT.

Multi-node Validation

Micro Benchmarks

  • IB Networking validation.
  • TCP validation (tool: TCPing).
  • GPCNet Validation (tool: GPCNet).

Other Improvement

  1. Enhancement

    • Add pipeline for AMD docker.
    • Integrate system config info script with SuperBench.
    • Support FP32 mode without TF32.
    • Refine unit test for microbenchmark.
    • Unify metric names for all benchmarks.
  2. Document

    • Add benchmark list.
    • Add monitor document.
    • Add data diagnosis document.