Releases: microsoft/superbenchmark
Releases · microsoft/superbenchmark
Release SuperBench v0.11.0
SuperBench 0.11.0 Release Notes
SuperBench Improvements
- Add CUDA 12.4 dockerfile.
- Upgrade nccl version to v2.23.4 and install ucx v1.16.0 in cuda 12.4 dockefile.
- Fix MSCCL build error in CUDA12.4 docker build pipeline.
- Add ROCm6.2 dockerfile.
- Update hpcx link in cuda11.1 dockerfile to fix docker build failure.
- Improve document (Fix metrics name and typos in user tutorial, add BibTeX in README and repo).
- Limit protobuf version to be 3.20.x to fix onnxruntime dependency error.
- Update omegaconf version to 2.3.0 and fix issues caused by omegaconf version update.
- Fix MSCCL build error in CUDA12.4 docker build pipeline.
- Update Docker Exec Command for Persistent HPCX Environment.
- Fix cuda 12.2 dockerfile LD_LIBRARY_PATH issue.
- Use types-setuptools to replace types-pkg_resources.
- Add configuration for NDv5 H100 and AMD MI300x.
Micro-benchmark Improvements
- Add hipblasLt tuning to dist-inference cpp implementation.
- Add support for NVIDIA L4/L40/L40s GPUs in gemm-flops.
- Upgrade mlc to v3.11.
Model-benchmark Improvements
- Support FP8 transformer model training in ROCm6.2 dockerfile.
Result Analysis
- Fix bug of failure test and warning of pandas in data diagnosis.
Release SuperBench v0.10.0
SuperBench 0.10.0 Release Notes
SuperBench Improvements
- Support monitoring for AMD GPUs.
- Support ROCm 5.7 and ROCm 6.0 dockerfile.
- Add MSCCL support for Nvidia GPU.
- Fix NUMA domains swap issue in NDv4 topology file.
- Add NDv5 topo file.
- Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA 12.2.
Micro-benchmark Improvements
- Add HPL random generator to gemm-flops with ROCm.
- Add DirectXGPURenderFPS benchmark to measure the FPS of rendering simple frames.
- Add HWDecoderFPS benchmark to measure the FPS of hardware decoder performance.
- Update Docker image for H100 support.
- Update MLC version into 3.10 for CUDA/ROCm dockerfile.
- Bug fix for GPU Burn test.
- Support INT8 in cublaslt function.
- Add hipBLASLt function benchmark.
- Support cpu-gpu and gpu-cpu in ib-validation.
- Support graph mode in NCCL/RCCL benchmarks for latency metrics.
- Support cpp implementation in distributed inference benchmark.
- Add O2 option for gpu copy ROCm build.
- Support different hipblasLt data types in dist inference.
- Support in-place in NCCL/RCCL benchmark.
- Support data type option in NCCL/RCCL benchmark.
- Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs.
- Update hipblaslt GEMM metric unit to tflops.
- Support FP8 for hipblaslt benchmark.
Model Benchmark Improvements
- Change torch.distributed.launch to torchrun.
- Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark.
Result Analysis
- Support baseline generation from multiple nodes.
Release SuperBench v0.9.0
SuperBench 0.9.0 Release Notes
SuperBench Improvements
- Support Ctrl+C and interrupt to stop all SuperBench testing.
- Support Windows Docker for VDI/Gaming GPU.
- Support DirectX platform for Nvidia and AMD GPU.
- Add System Config Info feature in SB runner to support distributed collection.
- Support DirectX test pipeline.
Micro-benchmark Improvements
- Add DirectXGPUCopyBw Benchmark to measure HtoD/DtoH bandwidth by DirectX.
- Add DirectXGPUCoreFLops Benchmark to measure peak FLOPS by DirectX.
- Add DirectXGPUMemBw Benchmark to measure GPU memory bandwidth by DirectX.
- Add DirectXVCNEncodingLatency Benchmark to measure the VCN hardware encoding latency on AMD graphic GPUs.
- Support best algorithm selection in cudnn-function microbenchmark.
- Revise step time collection in distributed inference benchmark.
Model Benchmark Improvements
- Fix early stop logic due to num_steps in model benchmarks.
- Support TensorRT models on Nvidia H100.
Documentation Improvements
- Improve documentation for System Config Info.
- Update outdate references.
Release SuperBench v0.8.0
SuperBench 0.8.0 Release Notes
SuperBench Improvements
- Support SuperBench Executor running on Windows.
- Remove fixed rccl version in rocm5.1.x docker file.
- Upgrade networkx version to fix installation compatibility issue.
- Pin setuptools version to v65.7.0.
- Limit ansible_runner version for Python 3.6.
- Support cgroup V2 when read system metrics in monitor.
- Fix analyzer bug in Python 3.8 due to pandas api change.
- Collect real-time GPU power in monitor.
- Remove unreachable condition when write host list in mpi mode.
- Upgrade Docker image with cuda12.1, nccl 2.17.1-1, hpcx v2.14, and mlc 3.10.
- Fix wrong unit of cpu-memory-bw-latency in document.
Micro-benchmark Improvements
- Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate.
- Add HPL Benchmark for HPC Linpack Benchmark.
- Support flexible warmup and non-random data initialization in cublas-benchmark.
- Support error tolerance in micro-benchmark for CuDNN function.
- Add distributed inference benchmark.
- Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm.
Model Benchmark Improvements
- Fix torch.dist init issue with multiple models.
- Support TE FP8 in BERT/GPT2 model.
- Add num_workers configurations in model benchmark.
Release SuperBench v0.7.0
SuperBench v0.7.0 Release Notes
SuperBench Improvements
- Support non-zero return code when "sb deploy" or "sb run" fails in Ansible.
- Support log flushing to the result file during runtime.
- Update version to include revision hash and date.
- Support "pattern" in mpi mode to run tasks in parallel.
- Support topo-aware, all-pair, and K-batch pattern in mpi mode.
- Fix Transformers version to avoid Tensorrt failure.
- Add CUDA11.8 Docker image for NVIDIA arch90 GPUs.
- Support "sb deploy" without pulling image.
Micro-benchmark Improvements
- Support list of custom config string in cudnn-functions and cublas-functions.
- Support correctness check in cublas-functions.
- Support GEMM-FLOPS for NVIDIA arch90 GPUs.
- Support cuBLASLt FP16 and FP8 GEMM.
- Add wait time option to resolve mem-bw unstable issue.
- Fix bug for incorrect datatype judgement in cublas-function source code.
Model Benchmark Improvements
- Support FP8 in BERT model training.
Distributed Benchmark Improvements
- Support pair-wise pattern in IB validation benchmark.
- Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark.
Release SuperBench v0.6.0
SuperBench v0.6.0 Release Notes
SuperBench Improvement
- Support running on host directly without Docker.
- Support running
sb
command inside docker image. - Support ROCm 5.1.1.
- Support ROCm 5.1.3.
- Fix bugs in data diagnosis.
- Fix cmake and build issues.
- Support automatic configuration yaml selection on Azure VM.
- Refine error message when GPU is not detected.
- Add return code for Timeout.
- Update Dockerfile for NCCL/RCCL version, tag name, and verbose output.
- Support node_num=1 in mpi mode.
- Update Python setup for require packages.
- Enhance parameter parsing to allow spaces in value.
- Support NO_COLOR for SuperBench output.
Micro-benchmark Improvements
- Fix issues in ib loopback benchmark.
- Fix stability issue in ib loopback benchmark.
Distributed Benchmark Improvements
- Enhance pair-wise IB benchmark.
- Bug Fix in IB benchmark.
- Support topology-aware IB benchmark.
Data Diagnosis and Analysis
- Add failure check function in data_diagnosis.py.
- Support JSON and JSONL in Diagnosis.
- Add support to store values of metrics in data diagnosis.
- Support exit code of sb result diagnosis.
- Format int type and unify empty value to N/A in diagnosis output files.
Pre-release v0.6.0-rc1
Pre-release v0.6.0-rc1.
Release SuperBench v0.5.0
SuperBench 0.5.0 Release Notes
Micro-benchmark Improvements
- Support NIC only NCCL bandwidth benchmark on single node in NCCL/RCCL bandwidth test.
- Support bi-directional bandwidth benchmark in GPU copy bandwidth test.
- Support data checking in GPU copy bandwidth test.
- Update rccl-tests submodule to fix divide by zero error.
- Add GPU-Burn micro-benchmark.
Model-benchmark Improvements
- Sync results on root rank for e2e model benchmarks in distributed mode.
- Support customized
env
in local and torch.distributed mode. - Add support for pytorch>=1.9.0.
- Keep BatchNorm as fp32 for pytorch cnn models cast to fp16.
- Remove FP16 samples type converting time.
- Support FAMBench.
Inference Benchmark Improvements
- Revise the default setting for inference benchmark.
- Add percentile metrics for inference benchmarks.
- Support T4 and A10 in GEMM benchmark.
- Add configuration with inference benchmark.
Other Improvements
- Add command to support listing all optional parameters for benchmarks.
- Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file.
- Support timeout to detect the benchmark failure and stop the process automatically.
- Add rocm5.0 dockerfile.
- Improve output interface.
Data Diagnosis and Analysis
- Support multi-benchmark check.
- Support result summary in md, html and excel formats.
- Support data diagnosis in md and html formats.
- Support result output for all nodes in data diagnosis.
Pre-release v0.5.0-rc1
Pre-release v0.5.0-rc1.
Release SuperBench v0.4.0
SuperBench 0.4.0 Release Notes
SuperBench Framework
Monitor
- Add monitor framework for NVIDIA GPU, CPU, memory and disk.
Data Diagnosis and Analysis
- Support baseline-based data diagnosis.
- Support basic analysis feature (boxplot figure, outlier detection, etc.).
Single-node Validation
Micro Benchmarks
- CPU Memory Validation (tool: Intel Memory Latency Checker).
- GPU Copy Bandwidth (tool: built by MSRA).
- Add ORT Model on AMD GPU platform.
- Add inference backend TensorRT.
- Add inference backend ORT.
Multi-node Validation
Micro Benchmarks
- IB Networking validation.
- TCP validation (tool: TCPing).
- GPCNet Validation (tool: GPCNet).
Other Improvement
-
Enhancement
- Add pipeline for AMD docker.
- Integrate system config info script with SuperBench.
- Support FP32 mode without TF32.
- Refine unit test for microbenchmark.
- Unify metric names for all benchmarks.
-
Document
- Add benchmark list.
- Add monitor document.
- Add data diagnosis document.