Releases: microsoft/superbenchmark
Releases · microsoft/superbenchmark
Release SuperBench v0.3.0
SuperBench v0.3.0 Release Notes
SuperBench Framework
Runner
- Implement MPI mode.
Benchmarks
- Support Docker benchmark.
Single-node Validation
Micro Benchmarks
-
Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)
Metrics Unit Description H2D_Mem_BW_GPU GB/s host-to-GPU bandwidth for each GPU D2H_Mem_BW_GPU GB/s GPU-to-host bandwidth for each GPU -
IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)
Metrics Unit Description IB_Write MB/s The IB write loopback throughput with different message sizes IB_Read MB/s The IB read loopback throughput with different message sizes IB_Send MB/s The IB send loopback throughput with different message sizes -
NCCL/RCCL (Tool: NCCL/RCCL Tests)
Metrics Unit Description NCCL_AllReduce GB/s The NCCL AllReduce performance with different message sizes NCCL_AllGather GB/s The NCCL AllGather performance with different message sizes NCCL_broadcast GB/s The NCCL Broadcast performance with different message sizes NCCL_reduce GB/s The NCCL Reduce performance with different message sizes NCCL_reduce_scatter GB/s The NCCL ReduceScatter performance with different message sizes -
Disk (Tool: FIO – Standard Disk Performance Tool)
Metrics Unit Description Seq_Read MB/s Sequential read performance Seq_Write MB/s Sequential write performance Rand_Read MB/s Random read performance Rand_Write MB/s Random write performance Seq_R/W_Read MB/s Read performance in sequential read/write, fixed measurement (read:write = 4:1) Seq_R/W_Write MB/s Write performance in sequential read/write (read:write = 4:1) Rand_R/W_Read MB/s Read performance in random read/write (read:write = 4:1) Rand_R/W_Write MB/s Write performance in random read/write (read:write = 4:1) -
H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)
Metrics Unit Description H2D_SM_BW_GPU GB/s host-to-GPU bandwidth using GPU kernel for each GPU D2H_SM_BW_GPU GB/s GPU-to-host bandwidth using GPU kernel for each GPU
AMD GPU Support
Docker Image Support
- ROCm 4.2 PyTorch 1.7.0
- ROCm 4.0 PyTorch 1.7.0
Micro Benchmarks
-
Kernel Launch (Tool: MSR-A build)
Metrics Unit Description Kernel_Launch_Event_Time Time (ms) Dispatch latency measured in GPU time using hipEventRecord() Kernel_Launch_Wall_Time Time (ms) Dispatch latency measured in CPU time -
GEMM FLOPS (Tool: AMD rocblas-bench Tool)
Metrics Unit Description FP64 GFLOPS FP64 FLOPS without MatrixCore FP32(MC) GFLOPS TF32 FLOPS with MatrixCore FP16(MC) GFLOPS FP16 FLOPS with MatrixCore BF16(MC) GFLOPS BF16 FLOPS with MatrixCore INT8(MC) GOPS INT8 FLOPS with MatrixCore
E2E Benchmarks
-
CNN models -- Use PyTorch torchvision models
- ResNet: ResNet-50, ResNet-101, ResNet-152
- DenseNet: DenseNet-169, DenseNet-201
- VGG: VGG-11, VGG-13, VGG-16, VGG-19
-
BERT -- Use huggingface Transformers
- BERT
- BERT Large
-
LSTM -- Use PyTorch
-
GPT-2 -- Use huggingface Transformers
Bug Fix
- VGG models failed on A100 GPU with batch_size=128
Other Improvement
-
Contribution related
- Contribute rule
- System information collection
-
Document
- Add release process doc
- Add design documents
- Add developer guide doc for coding style
- Add contribution rules
- Add docker image list
- Add initial validation results
Release SuperBench v0.2.1
SuperBench v0.2.1 Release Notes
Bug Fixes
- Fix Ansible connection issue when running in localhost.
- Fix crashes of vgg models distributed training.
- Fix bug when convert bool config to store_true argument.
Release SuperBench v0.2.0
SuperBench v0.2.0 Release Notes
SuperBench Framework
- Implemented a CLI to provide a command line interface.
- Implemented Runner for nodes control and management.
- Implemented Executor.
- Implemented Benchmark framework.
Supported Benchmarks
- Supported Micro-benchmarks
- GEMM FLOPS (GFLOPS, TensorCore, cuBLAS, cuDNN)
- Kernel Launch Time (Kernel_Launch_Event_Time, Kernel_Launch_Wall_Time)
- Operator Performance (MatMul, Sharding_MatMul)
- Supported Model-benchmarks
- CNN models
(Reference: torchvision models)- ResNet (ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152)
- DenseNet (DenseNet-161, DenseNet-169, DenseNet-201)
- VGG (VGG-11, VGG-13, VGG-16, VGG-19, VGG11_bn, VGG13_bn, VGG16_bn, VGG19_bn)
- MNASNet (mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3)
- AlexNet
- GoogLeNet
- Inception_v3
- mobilenet_v2
- ResNeXt (resnext50_32x4d, resnext101_32x8d)
- Wide ResNet (wide_resnet50_2, wide_resnet101_2)
- ShuffleNet (shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0)
- SqueezeNet (squeezenet1_0, squeezenet1_1)
- LSTM model
- BERT models (BERT-Base, BERT-Large)
- GPT-2 model (specify which config)
- CNN models
Examples and Documents
- Added examples to run benchmarks respectively.
- Tutorial Documents (introduction, getting-started, developer-guides, APIs, benchmarks).
- Built SuperBench website.