-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large performance regression for FP8 E4M3 GEMM with triton==2.3
#3828
Comments
triton==2.3
triton==2.3
cc @jansel @malfet @seemethere
|
Hey @atalman both triton and torch only support FP8 GEMM on GPUs with hardware support for FP8 tensor cores. So, this is intended to only work on Hopper (H100) or Ada Lovelace (L4, L40, RTX 4000 series) |
It seems the change to maxNumImpreciseAcc from #2804 brings the run time for matmuls back to 2.2.x levels. |
ah right this is because before that the accumulation was happening on a lower precision. To solve that you need to use the 3 source dot ( |
Fixed in triton 2.3.1 now :) |
There is a very large performance regression (6x slower for
[8192,8192]x[8192,8192]
) when using Triton for matmuls with float8 e4m3 inputs, comparing2.2.0
and2.3.0
.We use Triton for our fused MoE implementation in vLLM and noticed this regression while upgrading pytorch (thanks for quickly detecting @pcmoritz) from
2.2.1
->2.3.0
, which brought about an upgrade for Triton as well (2.2.0
->2.3.0
).This regression seems to go away if I use the latest nightly, but we are still stuck between very poor FP8 performance with Triton and using the latest stable PyTorch (which we would like to have for FP8 GEMM support on SM89). Is it possible this could be hotfixed?
Below I share my minimal reproduction using
triton.ops.matmul
on an H100:Results:
Benchmarking script:
The text was updated successfully, but these errors were encountered: