extra/gemm/max_matmul: start of custom kernels for GEMM #6926

flammit · 2024-10-07T01:27:43Z

As a prerequisite to implementing full speed GEMM for NV, here's the handwritten versions of GEMM that shows the incremental progress needed to get there and associated speed improvements.

Acc	Variation	Performance
FP32	hcopt	1301.50 us, would be 105600.10 GFLOPS matmul, 77.34 GB/s
FP32	flat_smem_input	1309.41 us, would be 104962.67 GFLOPS matmul, 76.88 GB/s
FP32	swizzled_smem_input	882.69 us, would be 155705.02 GFLOPS matmul, 114.04 GB/s
FP32	2_stage_swizzled_smem_input	831.49 us, would be 165292.77 GFLOPS matmul, 121.06 GB/s
FP32	max	826.37 us, would be 166316.89 GFLOPS matmul, 121.81 GB/s
FP16	3_stage_swizzled	505.66 us, would be 271798.97 GFLOPS matmul, 199.07 GB/s
FP16	max	404.48 us, would be 339791.71 GFLOPS matmul, 248.87 GB/s

The command for hcopt is: PYTHONPATH=. CUDA=1 GEMM_VARIATION="hcopt" DTYPE_IN=half DTYPE_OUT=half DTYPE_ACC=float CNT=1024 INPUT=RAND python3 ./extra/gemm/max_matmul.py .

The command for the rest of the FP32 acc is: PYTHONPATH=. CUDA=1 GEMM_VARIATION="$VARIATION" DTYPE_IN=half DTYPE_OUT=float DTYPE_ACC=float CNT=1024 INPUT=RAND python3 ./extra/gemm/max_matmul.py.

The command for the rest of the FP16 acc is: PYTHONPATH=. CUDA=1 GEMM_VARIATION="$VARIATION" DTYPE_IN=half DTYPE_OUT=half DTYPE_ACC=half CNT=1024 INPUT=ONES python3 ./extra/gemm/max_matmul.py.

flammit · 2024-10-07T20:33:44Z

added an unoptimized FP16 input/FP16 acc MMA example

geohot · 2024-10-08T02:46:57Z

So I tested this with Triton once, what does Triton get? IIRC it was well over 200. We should at least match that with handcoded stuff.

github-actions · 2024-10-11T00:03:35Z

This branch currently is behind tinygrad/master. The line count difference bot is disabled.

flammit · 2024-10-11T00:08:37Z

Added a 3-staged pipeline with swizzled SMEM inputs example for FP16 acc that does 270TF (still less than the 330TF in cutlass, but not as bad as before).

note this PR depends on #6956 being landed first.

flammit · 2024-10-17T01:40:57Z

happy to remove these extra stages/variations and just keep the "max" variations, but figure it might be useful for comparison as future kernel rendering features are incrementally added.

flammit mentioned this pull request Oct 7, 2024

[Bounty] Outline of "NVIDIA e2e full FP16 matmul speed from Tensor.matmul" #6928

Open

6 tasks

flammit force-pushed the max_matmul branch from 381b295 to 3d30117 Compare October 7, 2024 20:30

flammit marked this pull request as draft October 8, 2024 18:08

flammit mentioned this pull request Oct 8, 2024

ops_cuda: add optional dynamic smem parameter #6956

Merged

flammit force-pushed the max_matmul branch 2 times, most recently from 52feba8 to 7234150 Compare October 9, 2024 23:14

flammit force-pushed the max_matmul branch 2 times, most recently from 5b44292 to 9465409 Compare October 17, 2024 01:22

flammit marked this pull request as ready for review October 17, 2024 01:27

flammit force-pushed the max_matmul branch from 9465409 to 19de801 Compare October 17, 2024 17:34

flammit marked this pull request as draft October 21, 2024 16:54

flammit force-pushed the max_matmul branch from 19de801 to 7347755 Compare November 7, 2024 17:51

flammit force-pushed the max_matmul branch from 2aa5cb9 to 9bc317b Compare December 9, 2024 18:29

flammit added 9 commits December 12, 2024 15:17

extra/gemm/max_matmul: start of custom kernels for GEMM

4d62de3

add an unoptimized FP16/FP16 MMA example

2428263

add slow 3-stage fp16 acc example

99d84be

add correct 3-stage pipeline with unswizzled/flat smem input (slow)

7d78356

add acc fp16 example with 3 stages and swizzle (no bank conflicts)

3d64f06

add max version of NV fp16_fp16_fp16

cfb62f5

fix up comments and removed unused code in max variations

8cc250f

add start of no_xor example

fb2ed55

fix to account for UOps to Ops

791fd04

flammit force-pushed the max_matmul branch from 9bc317b to 791fd04 Compare December 12, 2024 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extra/gemm/max_matmul: start of custom kernels for GEMM #6926

extra/gemm/max_matmul: start of custom kernels for GEMM #6926

flammit commented Oct 7, 2024 •

edited

Loading

flammit commented Oct 7, 2024

geohot commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 11, 2024

flammit commented Oct 11, 2024

flammit commented Oct 17, 2024

extra/gemm/max_matmul: start of custom kernels for GEMM #6926

Are you sure you want to change the base?

extra/gemm/max_matmul: start of custom kernels for GEMM #6926

Conversation

flammit commented Oct 7, 2024 • edited Loading

flammit commented Oct 7, 2024

geohot commented Oct 8, 2024 • edited Loading

github-actions bot commented Oct 11, 2024

flammit commented Oct 11, 2024

flammit commented Oct 17, 2024

flammit commented Oct 7, 2024 •

edited

Loading

geohot commented Oct 8, 2024 •

edited

Loading