Skip to content

Latest commit

 

History

History
68 lines (57 loc) · 3.5 KB

README.md

File metadata and controls

68 lines (57 loc) · 3.5 KB

Introduction

This repository showcases various features of GEMM aimed at enhancing its performance.

C = alpha * A * B + beta * C

Matrix Multiplication Algorithm Implementations

Installation

  • Edit build.sh file
    • cmake -DCUDA_ARCH=/your/cuda/arch -DCUDA_TOOLKIT_ROOT_DIR=/local/cuda/path
  • bash build.sh
image

Performance

Run on RTX 4070 Ti | Theoretical Performance: FP32 (float) 40.09 TFLOPS

Benchmark Time CPU Iterations UserCounters
Naive/Gemm_float/5120/4096/4096 1731 ms 1731 ms 1 TFlops=0.099244/s, operation=171.799G
Blocker/Gemm_float/5120/4096/4096 103 ms 103 ms 6 TFlops=1.66191/s, operation=1030.79G
Strider/Gemm_float/5120/4096/4096 19.9 ms 19.9 ms 30 TFlops=8.62941/s, operation=5.15396T
Aligner/Gemm_float/5120/4096/4096 17.3 ms 17.3 ms 33 TFlops=9.93519/s, operation=5.66936T
MultiLoader/Gemm_float/5120/4096/4096 19.8 ms 19.8 ms 31 TFlops=8.67294/s, operation=5.32576T
BcAvoider/Gemm_float/5120/4096/4096 24.2 ms 24.2 ms 26 TFlops=7.10627/s, operation=4.46677T
PpBuffer/Gemm_float/5120/4096/4096 20.9 ms 20.9 ms 28 TFlops=8.2018/s, operation=4.81036T
Dense/Gemm_float/5120/4096/4096 11.0 ms 11.0 ms 61 TFlops=15.5654/s, operation=10.4797T
Cublas/Gemm_float/5120/4096/4096 5.95 ms 5.95 ms 115 TFlops=28.8656/s, operation=19.7568T
Yzaiustc/Gemm_float/5120/4096/4096 7.23 ms 7.23 ms 93 TFlops=23.765/s, operation=15.9773T
Yhs/Gemm_float/5120/4096/4096 6.78 ms 6.78 ms 100 TFlops=25.3418/s, operation=17.1799T

Todo

  • Address the bug causing a segment fault in MatrixMulCUDA7.
  • Fix the issue where CUDA implementations 0 to 6 cannot handle cases where m = 8 n = 4096 k = 4096.