Skip to content

Releases: microsoft/superbenchmark

Release SuperBench v0.3.0

24 Sep 05:57
15f22e2
Compare
Choose a tag to compare

SuperBench v0.3.0 Release Notes

SuperBench Framework

Runner

  • Implement MPI mode.

Benchmarks

  • Support Docker benchmark.

Single-node Validation

Micro Benchmarks

  1. Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)

    Metrics Unit Description
    H2D_Mem_BW_GPU GB/s host-to-GPU bandwidth for each GPU
    D2H_Mem_BW_GPU GB/s GPU-to-host bandwidth for each GPU
  2. IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)

    Metrics Unit Description
    IB_Write MB/s The IB write loopback throughput with different message sizes
    IB_Read MB/s The IB read loopback throughput with different message sizes
    IB_Send MB/s The IB send loopback throughput with different message sizes
  3. NCCL/RCCL (Tool: NCCL/RCCL Tests)

    Metrics Unit Description
    NCCL_AllReduce GB/s The NCCL AllReduce performance with different message sizes
    NCCL_AllGather GB/s The NCCL AllGather performance with different message sizes
    NCCL_broadcast GB/s The NCCL Broadcast performance with different message sizes
    NCCL_reduce GB/s The NCCL Reduce performance with different message sizes
    NCCL_reduce_scatter GB/s The NCCL ReduceScatter performance with different message sizes
  4. Disk (Tool: FIO – Standard Disk Performance Tool)

    Metrics Unit Description
    Seq_Read MB/s Sequential read performance
    Seq_Write MB/s Sequential write performance
    Rand_Read MB/s Random read performance
    Rand_Write MB/s Random write performance
    Seq_R/W_Read MB/s Read performance in sequential read/write, fixed measurement (read:write = 4:1)
    Seq_R/W_Write MB/s Write performance in sequential read/write (read:write = 4:1)
    Rand_R/W_Read MB/s Read performance in random read/write (read:write = 4:1)
    Rand_R/W_Write MB/s Write performance in random read/write (read:write = 4:1)
  5. H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)

    Metrics Unit Description
    H2D_SM_BW_GPU GB/s host-to-GPU bandwidth using GPU kernel for each GPU
    D2H_SM_BW_GPU GB/s GPU-to-host bandwidth using GPU kernel for each GPU

AMD GPU Support

Docker Image Support

  • ROCm 4.2 PyTorch 1.7.0
  • ROCm 4.0 PyTorch 1.7.0

Micro Benchmarks

  1. Kernel Launch (Tool: MSR-A build)

    Metrics Unit Description
    Kernel_Launch_Event_Time Time (ms) Dispatch latency measured in GPU time using hipEventRecord()
    Kernel_Launch_Wall_Time Time (ms) Dispatch latency measured in CPU time
  2. GEMM FLOPS (Tool: AMD rocblas-bench Tool)

    Metrics Unit Description
    FP64 GFLOPS FP64 FLOPS without MatrixCore
    FP32(MC) GFLOPS TF32 FLOPS with MatrixCore
    FP16(MC) GFLOPS FP16 FLOPS with MatrixCore
    BF16(MC) GFLOPS BF16 FLOPS with MatrixCore
    INT8(MC) GOPS INT8 FLOPS with MatrixCore

E2E Benchmarks

  1. CNN models -- Use PyTorch torchvision models

    • ResNet: ResNet-50, ResNet-101, ResNet-152
    • DenseNet: DenseNet-169, DenseNet-201
    • VGG: VGG-11, VGG-13, VGG-16, VGG-19​
  2. BERT -- Use huggingface Transformers

    • BERT
    • BERT Large
  3. LSTM -- Use PyTorch

  4. GPT-2 -- Use huggingface Transformers

Bug Fix

  • VGG models failed on A100 GPU with batch_size=128

Other Improvement

  1. Contribution related

    • Contribute rule
    • System information collection
  2. Document

    • Add release process doc
    • Add design documents
    • Add developer guide doc for coding style
    • Add contribution rules
    • Add docker image list
    • Add initial validation results

Release SuperBench v0.2.1

29 Jul 07:59
ad868c7
Compare
Choose a tag to compare

SuperBench v0.2.1 Release Notes

Bug Fixes

  • Fix Ansible connection issue when running in localhost.
  • Fix crashes of vgg models distributed training.
  • Fix bug when convert bool config to store_true argument.

Release SuperBench v0.2.0

05 Jul 12:29
27b757e
Compare
Choose a tag to compare

SuperBench v0.2.0 Release Notes

SuperBench Framework

  • Implemented a CLI to provide a command line interface.
  • Implemented Runner for nodes control and management.
  • Implemented Executor.
  • Implemented Benchmark framework.

Supported Benchmarks

  • Supported Micro-benchmarks
    • GEMM FLOPS (GFLOPS, TensorCore, cuBLAS, cuDNN)
    • Kernel Launch Time (Kernel_Launch_Event_Time, Kernel_Launch_Wall_Time)
    • Operator Performance (MatMul, Sharding_MatMul)
  • Supported Model-benchmarks
    • CNN models
      (Reference: torchvision models)
      • ResNet (ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152)
      • DenseNet (DenseNet-161, DenseNet-169, DenseNet-201)
      • VGG (VGG-11, VGG-13, VGG-16, VGG-19, VGG11_bn, VGG13_bn, VGG16_bn, VGG19_bn)
      • MNASNet (mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3)
      • AlexNet
      • GoogLeNet
      • Inception_v3
      • mobilenet_v2
      • ResNeXt (resnext50_32x4d, resnext101_32x8d)
      • Wide ResNet (wide_resnet50_2, wide_resnet101_2)
      • ShuffleNet (shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0)
      • SqueezeNet (squeezenet1_0, squeezenet1_1)
    • LSTM model
    • BERT models (BERT-Base, BERT-Large)
    • GPT-2 model (specify which config)

Examples and Documents

  • Added examples to run benchmarks respectively.
  • Tutorial Documents (introduction, getting-started, developer-guides, APIs, benchmarks).
  • Built SuperBench website.