-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement in ROCM fmha-backward #1082
Conversation
ensure ck_decoder does not dispatch in test_attn_bias_padded
Apply the existing linters (1/n)
add rocm_ci workflow
This reverts commit 12fb41c.
Fix lints
…ch the xformers scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the depending Composable Kernel, is it reliable to use the current CK trunk development branch?
xformers/csrc/attention/hip_fmha/attention_backward_generic_ck_tiled.cpp
Outdated
Show resolved
Hide resolved
static void Run(BatchedBackwardParams& param, hipStream_t stream) { | ||
{ | ||
constexpr ck_tile::index_t kBlockSize = 256; | ||
constexpr ck_tile::index_t kBlockSize = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about the reason on decreasing block size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just keep consistent with ck_tile codes
which is well tested for performance consideration
@@ -9,6 +9,46 @@ | |||
#include <ck_tile/core.hpp> | |||
#include <stdexcept> | |||
|
|||
#ifndef FMHA_SUPPORT_MAX_HEADDIM_128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! Checking if we want to only have head dim = 128 support (to save compile time), not 64, 32, 256, any easy way to configure this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use unset MAX_JOBS
, the compiling is very fast. Even though it is easy to only build for dim == 128, we don't like do this, since we are not very confident with our building since the scripts provided for verifying under tests/ are not specifically prepared for dim128. You know, for any change in the codes, we always try to run the following scripts to verify that every-thing is correctly running:
#> pytest tests/test_mem_eff_attention.py::test_forward
#> pytest tests/test_mem_eff_attention.py::test_backward
#> pytest tests/test_mem_eff_attention.py::test_dropout
#> pytest tests/test_mem_eff_attention.py::test_dropout_backward_ck
tests/test_mem_eff_attention.py
Outdated
@@ -1003,6 +998,38 @@ def test_dropout_backward_ck(dt, q_len, kv_len, batch_size, k, p): | |||
) | |||
|
|||
|
|||
@cuda_only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test got here as merge conflict resolution gone bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on the xFormers side (didn't review the generated files/generator but I trust you on that)
This PR is mostly for providing update with regards to ROCM FMHA Backward. Specifically:
To test/verify, using the following command/scripts
To benchmark performance, using
#> python xformers/benchmark/benchmark_mem_eff_attention.py --omit-forward