Large performance regression for FP8 E4M3 GEMM with `triton==2.3` #3828

mgoin · 2024-05-03T01:24:18Z

There is a very large performance regression (6x slower for [8192,8192]x[8192,8192]) when using Triton for matmuls with float8 e4m3 inputs, comparing 2.2.0 and 2.3.0.

We use Triton for our fused MoE implementation in vLLM and noticed this regression while upgrading pytorch (thanks for quickly detecting @pcmoritz) from 2.2.1 -> 2.3.0, which brought about an upgrade for Triton as well (2.2.0 -> 2.3.0).

This regression seems to go away if I use the latest nightly, but we are still stuck between very poor FP8 performance with Triton and using the latest stable PyTorch (which we would like to have for FP8 GEMM support on SM89). Is it possible this could be hotfixed?

Below I share my minimal reproduction using triton.ops.matmul on an H100:

Results:

> pip install triton==2.2 numpy torch
> python benchmark_fp8.py
Benchmarking [torch.Size([8192, 8192]), fp8e4nv] x [torch.Size([8192, 8192]), fp8e4nv]
Elapsed time for 100 iterations: 0.086390 seconds

> pip install triton==2.3
> python benchmark_fp8.py
Benchmarking [torch.Size([8192, 8192]), fp8e4nv] x [torch.Size([8192, 8192]), fp8e4nv]
Elapsed time for 100 iterations: 0.547999 seconds

> pip uninstall -y triton
> pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly
> python benchmark_fp8.py
Benchmarking [torch.Size([8192, 8192]), fp8e4nv] x [torch.Size([8192, 8192]), fp8e4nv]
Elapsed time for 100 iterations: 0.088446 seconds

Benchmarking script:

import triton
import triton.ops
import triton.language as tl
import torch
import time

benchmark_iters = 100

# Create input matrices
A = torch.randn(8192, 8192, dtype=torch.float16, device='cuda')
B = torch.randn(8192, 8192, dtype=torch.float16, device='cuda')

# Quantize
A_fp8 = A.to(torch.float8_e4m3fn)
B_fp8 = B.to(torch.float8_e4m3fn).T

# Convert to triton float8 dtype
A_fp8 = triton.reinterpret(A_fp8, tl.float8e4nv)
B_fp8 = triton.reinterpret(B_fp8, tl.float8e4nv)

print(f"Benchmarking [{A_fp8.shape}, {A_fp8.dtype}] x [{B_fp8.shape}, {B_fp8.dtype}]")

# Warm up GPU
for _ in range(10):
    c = triton.ops.matmul(A_fp8, B_fp8)
torch.cuda.synchronize()

# Timing the matmul
start_time = time.time()
for _ in range(benchmark_iters):
    c = triton.ops.matmul(A_fp8, B_fp8)
torch.cuda.synchronize()
elapsed_time = time.time() - start_time

print(f"Elapsed time for {benchmark_iters} iterations: {elapsed_time:.6f} seconds")

The text was updated successfully, but these errors were encountered:

atalman · 2024-05-06T15:40:02Z

cc @jansel @malfet @seemethere
This looks like H100 specific error. I am getting this issue on A100:

  File "/home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
    self.cache[device][key] = compile(
  File "/home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages/triton/compiler/compiler.py", line 191, in compile
    module = src.make_ir(options)
  File "/home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages/triton/compiler/compiler.py", line 117, in make_ir
    return ast_to_ttir(self.fn, self, options=options)
  File "/home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1231, in ast_to_ttir
    raise CompilationError(fn.src, node, repr(e)) from e
triton.compiler.errors.CompilationError: at 45:31:            a = tl.load(A)
            b = tl.load(B)
        else:
            k_remaining = K - k * (BLOCK_K * SPLIT_K)
            _0 = tl.zeros((1, 1), dtype=C.dtype.element_ty)
            a = tl.load(A, mask=rk[None, :] < k_remaining, other=_0)
            b = tl.load(B, mask=rk[:, None] < k_remaining, other=_0)
        if AB_DTYPE:
            a = a.to(C.dtype.element_ty)
            b = b.to(C.dtype.element_ty)
        if fp8_fast_accum:
            acc = tl.dot(a, b, acc, out_dtype=dot_out_dtype, allow_tf32=allow_tf32)
                               ^
AssertionError('Dot op does not support fp8e4nv on CUDA arch < 90')

mgoin · 2024-05-06T16:31:05Z

Hey @atalman both triton and torch only support FP8 GEMM on GPUs with hardware support for FP8 tensor cores. So, this is intended to only work on Hopper (H100) or Ada Lovelace (L4, L40, RTX 4000 series)

plotfi · 2024-05-07T19:06:20Z

It seems the change to maxNumImpreciseAcc from #2804 brings the run time for matmuls back to 2.2.x levels.

ThomasRaoux · 2024-05-07T19:11:28Z

It seems the change to maxNumImpreciseAcc from #2804 brings the run time for matmuls back to 2.2.x levels.

ah right this is because before that the accumulation was happening on a lower precision. To solve that you need to use the 3 source dot (acc = tl.dot(a, b, acc) instead of acc += tl.dot(a, b)) because the other representation suggests user wants a 32bits addition.

pcmoritz · 2024-06-06T22:10:31Z

Fixed in triton 2.3.1 now :)

mgoin changed the title ~~Large performance regression for FP8 GEMM with triton==2.3~~ Large performance regression for FP8 E4M3 GEMM with triton==2.3 May 3, 2024

plotfi mentioned this issue May 14, 2024

[FRONTEND] fix default max_num_imprecise_acc (#2804) #3851

Merged

mgoin mentioned this issue Jun 14, 2024

[Bug]: Performance : very slow inference for Mixtral 8x7B Instruct FP8 on H100 with 0.5.0 and 0.5.0.post1 vllm-project/vllm#5535

Closed

zhyncs mentioned this issue Jul 1, 2024

misc: update torch version range to 2.3.0 InternLM/lmdeploy#1858

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large performance regression for FP8 E4M3 GEMM with `triton==2.3` #3828

Large performance regression for FP8 E4M3 GEMM with `triton==2.3` #3828

mgoin commented May 3, 2024

atalman commented May 6, 2024 •

edited

Loading

mgoin commented May 6, 2024

plotfi commented May 7, 2024

ThomasRaoux commented May 7, 2024

pcmoritz commented Jun 6, 2024

Large performance regression for FP8 E4M3 GEMM with triton==2.3 #3828

Large performance regression for FP8 E4M3 GEMM with triton==2.3 #3828

Comments

mgoin commented May 3, 2024

atalman commented May 6, 2024 • edited Loading

mgoin commented May 6, 2024

plotfi commented May 7, 2024

ThomasRaoux commented May 7, 2024

pcmoritz commented Jun 6, 2024

Large performance regression for FP8 E4M3 GEMM with `triton==2.3` #3828

Large performance regression for FP8 E4M3 GEMM with `triton==2.3` #3828

atalman commented May 6, 2024 •

edited

Loading