Introduction

This repository showcases various features of GEMM aimed at enhancing its performance.

C = alpha * A * B + beta * C

Matrix Multiplication Algorithm Implementations

MatrixMulCUDA0
- Naive GEMM implementation.
MatrixMulCUDA1
- Utilizing warp/block for fused multiply-add (FMA) calculations.
MatrixMulCUDA2
- Loading data with strides from global memory to shared memory.
MatrixMulCUDA3
- Aligning shared memory for optimized memory access.
MatrixMulCUDA4
- Loading data twice per thread for improved data reuse.
MatrixMulCUDA5
- Minimizing bank conflicts in shared memory accesses.
MatrixMulCUDA6
- Using ping-pong buffer strategy.
MatrixMulCUDA7
- Implementing fast 128x128 block GEMM. (Note: A bug causing segment faults needs to be fixed.)
MatrixMulCUDA8
- desen refer to https://github.com/Cjkkkk/CUDA_gemm/blob/master/src/cuda/dense.cu
MatrixMulCUDA9
- Implementation using cuBLAS.
MatrixMulCUDA10
- yzaiust refer to https://github.com/yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
MatrixMulCUDA11
- yinghan resfer to https://github.com/Yinghan-Li/YHs_Sample/blob/master/cuda/gemm/sgemm.cu

Edit build.sh file
- cmake -DCUDA_ARCH=/your/cuda/arch -DCUDA_TOOLKIT_ROOT_DIR=/local/cuda/path
bash build.sh

Run on RTX 4070 Ti | Theoretical Performance: FP32 (float) 40.09 TFLOPS

Benchmark	Time	CPU	Iterations	UserCounters
Naive/Gemm_float/5120/4096/4096	1731 ms	1731 ms	1	TFlops=0.099244/s, operation=171.799G
Blocker/Gemm_float/5120/4096/4096	103 ms	103 ms	6	TFlops=1.66191/s, operation=1030.79G
Strider/Gemm_float/5120/4096/4096	19.9 ms	19.9 ms	30	TFlops=8.62941/s, operation=5.15396T
Aligner/Gemm_float/5120/4096/4096	17.3 ms	17.3 ms	33	TFlops=9.93519/s, operation=5.66936T
MultiLoader/Gemm_float/5120/4096/4096	19.8 ms	19.8 ms	31	TFlops=8.67294/s, operation=5.32576T
BcAvoider/Gemm_float/5120/4096/4096	24.2 ms	24.2 ms	26	TFlops=7.10627/s, operation=4.46677T
PpBuffer/Gemm_float/5120/4096/4096	20.9 ms	20.9 ms	28	TFlops=8.2018/s, operation=4.81036T
Dense/Gemm_float/5120/4096/4096	11.0 ms	11.0 ms	61	TFlops=15.5654/s, operation=10.4797T
Cublas/Gemm_float/5120/4096/4096	5.95 ms	5.95 ms	115	TFlops=28.8656/s, operation=19.7568T
Yzaiustc/Gemm_float/5120/4096/4096	7.23 ms	7.23 ms	93	TFlops=23.765/s, operation=15.9773T
Yhs/Gemm_float/5120/4096/4096	6.78 ms	6.78 ms	100	TFlops=25.3418/s, operation=17.1799T

Address the bug causing a segment fault in MatrixMulCUDA7.
Fix the issue where CUDA implementations 0 to 6 cannot handle cases where m = 8 n = 4096 k = 4096.