Improvement in ROCM fmha-backward #1082

qianfengz · 2024-08-17T06:22:16Z

This PR is mostly for providing update with regards to ROCM FMHA Backward. Specifically:

Improved the performance of FMHA Backward for generally all input shapes
Added support for Headdim256 (all headdim sizes bigger than 128 and less/equal to 256 currently supported)
Changed to use unpadded LSE layout for var-len sequence situation
Improved accuracy of output grad_query for both fp16 and bf16 input type by using fp32 for AtomicAdd based accumulation
Adapt to the kernel API changes in ck_tile fmha fwd/bwd kernel (to support the requirements from TriDao FlashAttention)
Add environment variable to control the number of compiled instances
Add environment variable to select the using of RTZ or RTN rounding method for fp32 to bf16 conversion to balance performance and accuracy
Bug fixing/stability enhancement

To test/verify, using the following command/scripts

#> pytest tests/test_mem_eff_attention.py::test_forward   -k "not triton" -k "not flshattF" -k "not "fa2F"
#> pytest tests/test_mem_eff_attention.py::test_backward -k "not flshattB" -k "not fa2B"
#> pytest tests/test_mem_eff_attention.py::test_dropout
#> pytest tests/test_mem_eff_attention.py::test_dropout_backward_ck

To benchmark performance, using

#> python xformers/benchmark/benchmark_mem_eff_attention.py --omit-forward

… latest

ensure ck_decoder does not dispatch in test_attn_bias_padded

Apply the existing linters (1/n)

add rocm_ci workflow

This reverts commit 12fb41c.

Fix lints

…ch the xformers scripts

jianyuh

for the depending Composable Kernel, is it reliable to use the current CK trunk development branch?

xformers/csrc/attention/hip_fmha/attention_backward_generic_ck_tiled.cpp

jianyuh · 2024-08-19T16:41:35Z

xformers/csrc/attention/hip_fmha/ck_tiled_fmha_batched_backward.h

  static void Run(BatchedBackwardParams& param, hipStream_t stream) {
    {
-      constexpr ck_tile::index_t kBlockSize = 256;
+      constexpr ck_tile::index_t kBlockSize = 64;


Curious about the reason on decreasing block size?

Just keep consistent with ck_tile codes

fmha_bwd.py #L524

which is well tested for performance consideration

xformers/csrc/attention/hip_fmha/generate_instances.py

jianyuh · 2024-08-19T16:54:19Z

xformers/csrc/attention/hip_fmha/ck_tiled_headdim_switch.h

@@ -9,6 +9,46 @@
 #include <ck_tile/core.hpp>
 #include <stdexcept>

+#ifndef FMHA_SUPPORT_MAX_HEADDIM_128


Thanks for adding this! Checking if we want to only have head dim = 128 support (to save compile time), not 64, 32, 256, any easy way to configure this?

Use unset MAX_JOBS, the compiling is very fast. Even though it is easy to only build for dim == 128, we don't like do this, since we are not very confident with our building since the scripts provided for verifying under tests/ are not specifically prepared for dim128. You know, for any change in the codes, we always try to run the following scripts to verify that every-thing is correctly running:

#> pytest tests/test_mem_eff_attention.py::test_forward #> pytest tests/test_mem_eff_attention.py::test_backward #> pytest tests/test_mem_eff_attention.py::test_dropout #> pytest tests/test_mem_eff_attention.py::test_dropout_backward_ck

tests/test_mem_eff_attention.py

tenpercent · 2024-08-19T23:13:45Z

tests/test_mem_eff_attention.py

@@ -1003,6 +998,38 @@ def test_dropout_backward_ck(dt, q_len, kv_len, batch_size, k, p):
    )


+@cuda_only


I think this test got here as merge conflict resolution gone bad?

…om flexible location

…pstream_pr

danthe3rd

LGTM on the xFormers side (didn't review the generated files/generator but I trust you on that)

qianfengz and others added 30 commits February 4, 2024 15:24

Change the branch for composable_kernel_tiled submodule and update to…

1a3ce52

… latest

Remove the using of seqlen_cpu in BwOp of ck.py

f7bf9b4

Remove the using of seqlen_cpu in BwOp of ck.py

15d2a72

Align .clang_format with main branch and re-format c++ files

bcd1936

Synchronize to latest ck-tiled commit

52ae8a3

Merge branch 'ck-tiled-fa' into develop

af2aa86

Add checking of IS_CK_TILED into some testing scripts

7dd3aee

Update to test_mem_eff_attention.py and ck.py

5eb1235

Merge branch 'ck-tiled-fa' into develop

dc0e67a

Building xformers using ck-tiled as default

58e6101

Merge branch 'ck-tiled-fa' into develop

1276abc

ensure ck_decoder does not dispatch

389dfb4

Add disable_on_rocm on some test scripts

f8d9043

Merge branch 'ck-tiled-fa' into develop

78df6a9

Update to test_mem_eff_attention.py

6dae63c

Merge branch 'ck-tiled-fa' into develop

a7ed88c

Merge pull request #16 from ROCm/fix_test_attn_bias_padded

20e178a

ensure ck_decoder does not dispatch in test_attn_bias_padded

apply isort

0624c92

apply black

b8ebf08

fix flake8 suggestions

3b33c5d

add license headers and reapply black

0a9c933

Merge pull request #17 from ROCm/linters

47367a4

Apply the existing linters (1/n)

Merge pull request #10 from ROCm/enable-ci

fb46611

add rocm_ci workflow

Tiny update to rocm_ci.yml

28d3672

Add conditional compiling for cuda-depending codes in ROCM

12fb41c

Update to benchmark scripts

a9d83c6

Rename the one script file

9ab3831

Revert "Add conditional compiling for cuda-depending codes in ROCM"

243dc6a

This reverts commit 12fb41c.

Update to scripts

3240ba1

Change and add readme for tests and benchmarks

0c51af1

qianfengz and others added 6 commits August 15, 2024 22:42

Synchronize to latest ck_tile commit

367274c

apply black

f7b28c5

apply flake8

fd82f20

fix mypy

7d21800

revert disable flash operator on rocm

d6b6456

Synchronize to ck_tile latest commit again

87188ea

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: rocm labels Aug 17, 2024

qianfengz and others added 8 commits August 17, 2024 09:27

Re-position the composable_kernel submodule to the develop branch

5be80a3

Merge pull request #20 from tenpercent/develop

cee0980

Fix lints

Avoid the Async pipeline when khasBias is true

2a5c141

clang-format for two files

2874842

Merge branch 'main' into upstream_pr

cbb557d

Change allocation of grouped mode lse from [H, M] to [1, H, M] to mat…

1a73f34

…ch the xformers scripts

Synchronize to the upstream rocm_ci workflows

4440714

Re-format tests/test_mem_eff_attention.py

db2b52e

jianyuh reviewed Aug 19, 2024

View reviewed changes

tenpercent reviewed Aug 19, 2024

View reviewed changes

qianfengz added 4 commits August 20, 2024 08:46

Change in generate_instances.py so that this scripts can be called fr…

d293caf

…om flexible location

Merge branch 'upstream_pr' of https://github.com/ROCm/xformers into u…

8eb1bbd

…pstream_pr

Add GENERATE_INSTANCES.md

ee9640a

clean-up commented codes

3cf5721

jianyuh requested review from bottler, danthe3rd and sgrigory August 21, 2024 07:36

Remove un-used test

01cc08e

danthe3rd approved these changes Aug 21, 2024

View reviewed changes

jianyuh approved these changes Aug 22, 2024

View reviewed changes

jianyuh merged commit e3900ba into facebookresearch:main Aug 22, 2024
23 checks passed

qianfengz deleted the upstream_pr branch September 5, 2024 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement in ROCM fmha-backward #1082

Improvement in ROCM fmha-backward #1082

qianfengz commented Aug 17, 2024 •

edited

Loading

jianyuh left a comment

jianyuh Aug 19, 2024

qianfengz Aug 20, 2024

jianyuh Aug 19, 2024

qianfengz Aug 20, 2024

tenpercent Aug 19, 2024

danthe3rd left a comment

		@@ -1003,6 +998,38 @@ def test_dropout_backward_ck(dt, q_len, kv_len, batch_size, k, p):
		)


		@cuda_only

Improvement in ROCM fmha-backward #1082

Improvement in ROCM fmha-backward #1082

Conversation

qianfengz commented Aug 17, 2024 • edited Loading

jianyuh left a comment

Choose a reason for hiding this comment

jianyuh Aug 19, 2024

Choose a reason for hiding this comment

qianfengz Aug 20, 2024

Choose a reason for hiding this comment

jianyuh Aug 19, 2024

Choose a reason for hiding this comment

qianfengz Aug 20, 2024

Choose a reason for hiding this comment

tenpercent Aug 19, 2024

Choose a reason for hiding this comment

danthe3rd left a comment

Choose a reason for hiding this comment

qianfengz commented Aug 17, 2024 •

edited

Loading