Fast Multi-ahead Attention support on AMD ROCM #978

qianfengz · 2024-02-08T05:16:19Z

This PR adds three flash-attention implementation for AMD ROCM

Generic FMHA forward based on composable_kernel kernel components accelerated on AMD MI2xx/MI3xx
decoder FMHA forward directly implemented in HIP kernel
triton FMHA forward operation based on triton

In more details, the following codes are added in this PR

Xformers Operator and its C++ implementation for Generic FMHA forward as well as the underlying composable_kernel_tiled submodule

xformers/ops/fmha/ck.py
xformers/csrc/attention/hip_fmha/
thirty_party/composable_kernel_tiled/
Xformers Operator and its C++ implementation for decoder FMHA forward

xformers/ops/fmha/ck_decoder.py, ck_splitk.py
xformers/csrc/attention/hip_fmha/
Xformers Operator for triton FMHA forward

xformers/ops/fmha/triton.py

The following scripts are used to verify the implementation

#> pytest tests/test_mem_eff_attention.py::test_forward
#> pytest tests/test_mem_eff_attention.py::test_mqa_forwrd
#> pytest tests/test_mem_eff_attention.py::test_decoder
#> pytest tests/test_mem_eff_attention.py::test_splitk_decoder
#> pytest  tests/test_mem_eff_attention.py::test_splitk_reference
#> pytest tests/test_mem_eff_attention.py::test_triton_splitk_decoder

The following scripts are used to benchmark the performance of the implementation

#> python xformers/benchmarks/benchmark_mem_eff_attention.py
#> python xformers/benchmarks/benchmark_mem_eff_attention_mqa.py
#> python xformers/benchmarks/benchmark_mem_eff_attn_decoder.py
#> python xformers/benchmarks/benchmark_attn_decoding.py

… numerics in reduction

…erformance on ck-tiled fmha

…a-pad-support branch

…l unit_tests passed

…inner_product bhalf_t overloading in ck_attention_forward_decoder.h

…mha kernel

…bugging must go on

tenpercent · 2024-02-26T23:00:19Z

should I just do a pastebin? there are quite a few errors like part 1 edit: side note: my default GFX isnt 1030 but its required for my gpu to work on rocm to be exported override to that, before on original the GFX had 2 persisting errors on part 2

I think the related part is

/home/hina/xformers/third_party/composable_kernel_tiled/include/ck/tile_program/warp_tile/warp_gemm_attribute_mfma_impl_hip.hpp:116:17: error: '__builtin_amdgcn_mfma_f32_32x32x8bf16_1k' needs target feature mai-insts
            c_vec = __builtin_amdgcn_mfma_f32_32x32x8bf16_1k(a_vec, b_vec, c_vec, 0, 0, 0);
                    ^
    1 error generated when compiling for gfx1030.

But yeah, pastebin works better for such walls of text
And it looks like we are using a compiler intrinsic __builtin_amdgcn_mfma_f32_32x32x8bf16_1k, and it cannot compile on gfx1030 as this arch doesn't support it

HinaHyugaHime · 2024-02-26T23:02:16Z

But yeah, pastebin works better for such walls of text And it looks like we are using a compiler intrinsic __builtin_amdgcn_mfma_f32_32x32x8bf16_1k, and it cannot compile on gfx1030 as this arch doesn't support it

meaning I wouldnt be able to use the end result?

tenpercent · 2024-02-26T23:06:51Z

But yeah, pastebin works better for such walls of text And it looks like we are using a compiler intrinsic __builtin_amdgcn_mfma_f32_32x32x8bf16_1k, and it cannot compile on gfx1030 as this arch doesn't support it

meaning I wouldnt be able to use the end result?

Yes, meaning xformers is not yet supported on your device. Depending on your application, there may exist some other solution for using GPU acceleration on your device

HinaHyugaHime · 2024-02-26T23:19:43Z

But yeah, pastebin works better for such walls of text And it looks like we are using a compiler intrinsic __builtin_amdgcn_mfma_f32_32x32x8bf16_1k, and it cannot compile on gfx1030 as this arch doesn't support it

meaning I wouldnt be able to use the end result?

Yes, meaning xformers is not yet supported on your device. Depending on your application, there may exist some other solution for using GPU acceleration on your device

you have any recommendations to try?

tenpercent · 2024-02-27T23:18:52Z

Now win8-build fails with

C:/Users/runneradmin/AppData/Local/Temp/pip-req-build-o3otijw7/third_party/flash-attention/csrc/cutlass/include\cute/int_tuple.hpp(85): error C2665: 'cute::get': no overloaded function could convert all the argument types

tenpercent · 2024-02-27T23:19:37Z

But yeah, pastebin works better for such walls of text And it looks like we are using a compiler intrinsic __builtin_amdgcn_mfma_f32_32x32x8bf16_1k, and it cannot compile on gfx1030 as this arch doesn't support it

meaning I wouldnt be able to use the end result?

Yes, meaning xformers is not yet supported on your device. Depending on your application, there may exist some other solution for using GPU acceleration on your device

you have any recommendations to try?

Is there any specific model you're trying to run?

HinaHyugaHime · 2024-02-27T23:20:24Z

But yeah, pastebin works better for such walls of text And it looks like we are using a compiler intrinsic __builtin_amdgcn_mfma_f32_32x32x8bf16_1k, and it cannot compile on gfx1030 as this arch doesn't support it

meaning I wouldnt be able to use the end result?

Yes, meaning xformers is not yet supported on your device. Depending on your application, there may exist some other solution for using GPU acceleration on your device

you have any recommendations to try?

Is there any specific model you're trying to run?

Pytorch SD

tenpercent · 2024-02-27T23:27:35Z

But yeah, pastebin works better for such walls of text And it looks like we are using a compiler intrinsic __builtin_amdgcn_mfma_f32_32x32x8bf16_1k, and it cannot compile on gfx1030 as this arch doesn't support it

meaning I wouldnt be able to use the end result?

Yes, meaning xformers is not yet supported on your device. Depending on your application, there may exist some other solution for using GPU acceleration on your device

you have any recommendations to try?

Is there any specific model you're trying to run?

Pytorch SD

You could try your luck with https://github.com/ROCm/AITemplate/tree/navi3_rel_ver_1.1/examples/05_stable_diffusion

bottler · 2024-02-28T10:46:39Z

xformers/ops/fmha/common.py

@@ -180,11 +180,13 @@ def validate_inputs(self) -> None:
                and self.value.shape == (B, Mkv, Kv)
            )
        H = self.query.shape[-2]
+        Hkv = self.key.shape[-2]


The current interface of MHA is that H has to always match between qkv. If you want to do GQA - e.g. one kv-head for every n q-heads, you have to send 5D inputs. (Thus we're forcing the user to be very explicit.) Do we really want to relax that rule in this PR?

xformers/ops/fmha/triton.py

bottler · 2024-02-28T10:52:29Z

docs/source/components/ops.rst

@@ -22,13 +22,25 @@ Available implementations
    :member-order: bysource


(Looks like decoder and triton_splitk should have been added here months ago. 🫢)

bottler · 2024-02-28T10:59:11Z

tests/test_mem_eff_attention.py

+rocm_only = pytest.mark.skipif(
+    not torch.cuda.is_available() or not torch.version.hip, reason="requires ROCM"
+)
+disable_on_rocm = pytest.mark.skipif(
+    not not torch.version.hip, reason="could not be done on ROCM"
+)


Can you help me understand these two decorators? Is not not torch.version.hip equivalent to torch.version.hip is not None? And what does the rocm_only condition really mean?

Also perhaps those new tests which only apply to AMD could be in a new file?

sgrigory · 2024-02-28T11:44:20Z

tests/test_mem_eff_attention.py

+    "attn_bias_type", [type(None), torch.Tensor, fmha.attn_bias.LowerTriangularMask]
+)
+@pytest.mark.parametrize("op", [fmha.ck.FwOp])
+def test_mqa_forward(


To follow up on @bottler's question about @rocm_only - it'd be better for fmha.ck.FwOp to be covered by the generic test_forward and test_mqa_decoding. Then we don't need a separate test function (and eventually won't need @rocm_only after all such cases are refactored)

Not sure if this should be blocking the merge or can be done as a follow-up

I agree with that - let's try to factor as much code as possible :)

danthe3rd

This is looking good! A few more comments on the PR - mostly to simplify the test file, but otherwise we could merge.
Let's not worry about the windows build on CI which is already broken on master (and not even building the ROCm stuff anyway)

danthe3rd · 2024-02-28T12:11:28Z

tests/test_mem_eff_attention.py

@@ -310,6 +318,185 @@ def T(t):
    return out.permute((0, 2, 1, 3))


+def ref_attention_splitk_bmhk(


I understand that this is useful for debugging the kernel, but then maybe include it as a standalone python file in the repo with the C++ files? It does not need to be part of the test file (which is already way too long! :) )
(same goes for the function below ref_attention_mqa)

danthe3rd · 2024-02-28T12:13:21Z

tests/test_mem_eff_attention.py

+    "attn_bias_type", [type(None), torch.Tensor, fmha.attn_bias.LowerTriangularMask]
+)
+@pytest.mark.parametrize("op", [fmha.ck.FwOp])
+def test_mqa_forward(


I agree with that - let's try to factor as much code as possible :)

danthe3rd · 2024-02-28T12:15:40Z

tests/test_mem_eff_attention.py

+    if op is fmha.triton.FwOp:
+        pytest.skip("Triton Flash Attention 2 doesn't support backward pass yet")


Oh I guess it's not supported on AMD because we don't have any backward pass. Fine for me to exclude it for NVIDIA as well.

danthe3rd · 2024-02-28T12:17:24Z

tests/test_mem_eff_attention.py

+
+    if skip_reasons := op.not_supported_reasons(fmha.Inputs(q, q, q)):
+        pytest.skip("; ".join(skip_reasons))
+


This defeats the point of this test. None of the operators support these sort of strides.
What was the failure?

Thanks for catching! It got here as a refactor of skipping the test when op uses triton and python version is 3.8 or older, and I missed the context when refactoring. I think we can skip this check and the next one

danthe3rd · 2024-02-28T12:17:33Z

tests/test_mem_eff_attention.py

+
+    if skip_reasons := op.not_supported_reasons(fmha.Inputs(q, q, q)):
+        pytest.skip("; ".join(skip_reasons))
+


danthe3rd · 2024-02-28T12:18:09Z

tests/test_mem_eff_attention.py

+@pytest.mark.parametrize("padding, bsz", [(32, 8), (4096, 1)])
+@pytest.mark.parametrize("split_k", [1, 2, 4])
+@pytest.mark.parametrize("device", ["cpu"])
+def test_splitk_reference(


this test should not be needed if we use always ref_attention instead of ref_attention_splitk

tests/test_mem_eff_attention.py

... so users are forced to provide rank-5 inputs for mqa/gqa

roll back fmha/common.py

danthe3rd

This LGTM to merge in xFormers! Thanks a lot for the huge effort in making the entire codebase compatible with AMD GPUs - that was not an easy thing :o
Probably still a few things to improve, but we leave that to the future :)

> Triton does not support if expressions (ternary operators) with dynamic conditions, use if statements instead

…nto dev_upstream

jianyuh · 2024-02-09T17:57:09Z

xformers/benchmarks/benchmark_mem_eff_attention.py

 OPS = [
    (xformers.ops.fmha.cutlass.FwOp, xformers.ops.fmha.cutlass.BwOp),
    (xformers.ops.fmha.flash.FwOp, xformers.ops.fmha.flash.BwOp),
-    # TODO: Triton is not stable: it can trigger Illegal Memory Accesses


Keep this comment?

jianyuh · 2024-03-02T07:19:23Z

xformers/ops/fmha/ck.py

+
+
+def _minimum_gemm_alignment(inp: Inputs) -> int:
+    return 1


For cuda/NV GPU, we have gemm alignment like https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/cutlass.py#L57-L60 . Wonder here why it is set to 1?

HinaHyugaHime · 2024-03-07T06:43:05Z

dan I dont think this pull was ready yet but ok

danthe3rd · 2024-03-07T09:23:48Z

@HinaHyugaHime this is working good enough for internal users, and we wanted to get this merged asap: given the size of the change, it would have been a nightmare to constantly rebase/merge new changes.
xFormers still supports NVIDIA mainly, and AMD support is experimental / "best-effort". Feel free to open new issues and tag the people from this PR tho.

tenpercent and others added 30 commits December 7, 2023 20:47

add option to build a standalone runner for splitk decoder; debugging…

bc23333

… numerics in reduction

fix a few bugs

2c7b9bb

fix an indexing bug

709727f

stash changes

785481c

Add benchmark_mem_eff_attn_mqa_gqa_ck_tiled.py to benchmark mqa/gqa p…

ff0ebdb

…erformance on ck-tiled fmha

Synchronize with latest update in composable_kernel_tiled feature/fmh…

9a8baf7

…a-pad-support branch

Tiny fix in benchmark_mem_eff_attn_mqa_gqa_ck_tiled.py

959ae7f

Synchronize with latest update in composable_kernel_tiled and make al…

cc2f487

…l unit_tests passed

Swith to new branch for composable_kernel_tiled submodule

2162b45

Add bfp16 instances for ck-tiled inference

d6cf545

Update to test and benchmark scripts to include bfloat16

5cfda98

Tiny update to ck_tiled kernel

ab60547

Change to benchmark_mem_eff_attn_mqa_gqa_ck_tiled benchmark cases

a2af789

stash changes

d957dd9

Use Async pipeline for no M/N0K1 padding cases

40aa884

Add CF_FMHA_FWD_FAST_EXP2 to buiding

73e97d8

Add Triton FA2 forward op

b0c7023

Add Triton Flash Attention 2 to benchmarks

63c3523

Synchronize with latest third_party/composable_kernel and remove the …

fbd836a

…inner_product bhalf_t overloading in ck_attention_forward_decoder.h

stash split attention testing wip

0d15f1b

Synchronize with latest third_party/composable_kernel again

5c1bc54

Merge branch 'develop' into ck-tiled-fa

0172147

Synchronize with latest third_party/composable_kernel_tiled

a018550

Change to make ck decoder buildable with both ck tiled or non-tiled f…

31da32e

…mha kernel

Change to make ck decoder buildable with both ck tiled or non-tiled f…

22c8d6f

…mha kernel

fix gqa for split-k=1

6428374

Skip backward tests, fix import

f21e39a

fix the mask for decoding; row max and lse are computed correctly; de…

6c5540c

…bugging must go on

make libtorch split-1 decoder implementation pass numerical correctness

5225eef

Disable CK kernel for large shapes, better catch OOMs

45727d6

bottler reviewed Feb 28, 2024

View reviewed changes

sgrigory reviewed Feb 28, 2024

View reviewed changes

danthe3rd reviewed Feb 28, 2024

View reviewed changes

tenpercent and others added 7 commits February 28, 2024 19:23

unskip test_unsupported_alignment

1db3a5a

move out test_splitk_reference

57d7e96

add license header to file created in prev commit

14c831e

roll back fmha/common.py

d5a26a6

... so users are forced to provide rank-5 inputs for mqa/gqa

fix lint

3560806

remove unused ref_attention_mqa

f654b3a

Merge pull request #5 from ROCm/roll-back-fmha-common

99947ff

roll back fmha/common.py

danthe3rd approved these changes Mar 1, 2024

View reviewed changes

tenpercent added 3 commits March 1, 2024 18:24

resolve error in triton_splitk on rocm

c5ea221

> Triton does not support if expressions (ternary operators) with dynamic conditions, use if statements instead

Merge branch 'main' of https://github.com/facebookresearch/xformers i…

b585563

…nto dev_upstream

disable partial attention tests on rocm

6752f07

jianyuh reviewed Mar 2, 2024

View reviewed changes

danthe3rd merged commit 44b0d07 into facebookresearch:main Mar 4, 2024
27 of 32 checks passed

qianfengz deleted the dev_upstream branch March 19, 2024 09:10

gardner mentioned this pull request Apr 17, 2024

[ROCm/xformers] error generated when compiling for gfx1100 #1026

Open

d-ber mentioned this pull request May 22, 2024

Fails to build for Radeon RX 6950 XT (gfx1030) ROCm/xformers#12

Open

lamikr mentioned this pull request Aug 5, 2024

Patches from 6.1.1 to 6.1.2, watch for regressions lamikr/rocm_sdk_builder#125

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Multi-ahead Attention support on AMD ROCM #978

Fast Multi-ahead Attention support on AMD ROCM #978

qianfengz commented Feb 8, 2024

tenpercent commented Feb 26, 2024

HinaHyugaHime commented Feb 26, 2024

tenpercent commented Feb 26, 2024 •

edited

Loading

HinaHyugaHime commented Feb 26, 2024

tenpercent commented Feb 27, 2024

tenpercent commented Feb 27, 2024

HinaHyugaHime commented Feb 27, 2024

tenpercent commented Feb 27, 2024

bottler Feb 28, 2024

bottler Feb 28, 2024

bottler Feb 28, 2024 •

edited

Loading

sgrigory Feb 28, 2024

danthe3rd Feb 28, 2024

danthe3rd left a comment

danthe3rd Feb 28, 2024

danthe3rd Feb 28, 2024

danthe3rd Feb 28, 2024

danthe3rd Feb 28, 2024

tenpercent Feb 28, 2024

danthe3rd Feb 28, 2024

danthe3rd Feb 28, 2024

danthe3rd left a comment

jianyuh Feb 9, 2024

jianyuh Mar 2, 2024

HinaHyugaHime commented Mar 7, 2024

danthe3rd commented Mar 7, 2024

		@@ -22,13 +22,25 @@ Available implementations
		:member-order: bysource

		@@ -310,6 +318,185 @@ def T(t):
		return out.permute((0, 2, 1, 3))


		def ref_attention_splitk_bmhk(

		if op is fmha.triton.FwOp:
		pytest.skip("Triton Flash Attention 2 doesn't support backward pass yet")


		if skip_reasons := op.not_supported_reasons(fmha.Inputs(q, q, q)):
		pytest.skip("; ".join(skip_reasons))

Fast Multi-ahead Attention support on AMD ROCM #978

Fast Multi-ahead Attention support on AMD ROCM #978

Conversation

qianfengz commented Feb 8, 2024

This PR adds three flash-attention implementation for AMD ROCM

In more details, the following codes are added in this PR

The following scripts are used to verify the implementation

The following scripts are used to benchmark the performance of the implementation

tenpercent commented Feb 26, 2024

HinaHyugaHime commented Feb 26, 2024

tenpercent commented Feb 26, 2024 • edited Loading

HinaHyugaHime commented Feb 26, 2024

tenpercent commented Feb 27, 2024

tenpercent commented Feb 27, 2024

HinaHyugaHime commented Feb 27, 2024

tenpercent commented Feb 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bottler Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danthe3rd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danthe3rd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HinaHyugaHime commented Mar 7, 2024

danthe3rd commented Mar 7, 2024

tenpercent commented Feb 26, 2024 •

edited

Loading

bottler Feb 28, 2024 •

edited

Loading