-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement in ROCM fmha-backward #1082
Merged
Merged
Changes from 250 commits
Commits
Show all changes
696 commits
Select commit
Hold shift + click to select a range
1a3ce52
Change the branch for composable_kernel_tiled submodule and update to…
qianfengz f7bf9b4
Remove the using of seqlen_cpu in BwOp of ck.py
qianfengz 15d2a72
Remove the using of seqlen_cpu in BwOp of ck.py
qianfengz bcd1936
Align .clang_format with main branch and re-format c++ files
qianfengz 52ae8a3
Synchronize to latest ck-tiled commit
qianfengz af2aa86
Merge branch 'ck-tiled-fa' into develop
qianfengz 7dd3aee
Add checking of IS_CK_TILED into some testing scripts
qianfengz 5eb1235
Update to test_mem_eff_attention.py and ck.py
qianfengz dc0e67a
Merge branch 'ck-tiled-fa' into develop
qianfengz 58e6101
Building xformers using ck-tiled as default
qianfengz 1276abc
Merge branch 'ck-tiled-fa' into develop
qianfengz 389dfb4
ensure ck_decoder does not dispatch
tenpercent f8d9043
Add disable_on_rocm on some test scripts
qianfengz 78df6a9
Merge branch 'ck-tiled-fa' into develop
qianfengz 6dae63c
Update to test_mem_eff_attention.py
qianfengz a7ed88c
Merge branch 'ck-tiled-fa' into develop
qianfengz 20e178a
Merge pull request #16 from ROCm/fix_test_attn_bias_padded
qianfengz 0624c92
apply isort
tenpercent b8ebf08
apply black
tenpercent 3b33c5d
fix flake8 suggestions
tenpercent 0a9c933
add license headers and reapply black
tenpercent 47367a4
Merge pull request #17 from ROCm/linters
qianfengz fb46611
Merge pull request #10 from ROCm/enable-ci
qianfengz 28d3672
Tiny update to rocm_ci.yml
qianfengz 12fb41c
Add conditional compiling for cuda-depending codes in ROCM
qianfengz a9d83c6
Update to benchmark scripts
qianfengz 9ab3831
Rename the one script file
qianfengz 243dc6a
Revert "Add conditional compiling for cuda-depending codes in ROCM"
qianfengz 3240ba1
Update to scripts
qianfengz 0c51af1
Change and add readme for tests and benchmarks
qianfengz f36c93b
Remove the stuffs for supporting old ck
qianfengz 9e4582d
Remove old composable_kernel from submodule list
qianfengz 356cafd
Remove folder third_party/composable_kernel
qianfengz 8415b00
Merge branch 'develop' into dev_to_upstream
qianfengz 79c554c
Rename the folder
qianfengz 2be6c04
Remove unused script file
qianfengz 61d875a
apply black
tenpercent 4616121
pacify mypy
tenpercent 832e223
fix clang-format
tenpercent 2b2967e
reapply black
tenpercent 89fb7d6
Merge pull request #3 from tenpercent/lints
tenpercent 3c9d4e5
fix lints
tenpercent 1d474c5
make test_splitk_reference run on cpu
tenpercent d38a684
add ck modules to docs
tenpercent eccbf54
try fixing nvidia build by re-including sparse24 cpp folder into exte…
tenpercent 1ef6c20
update cutlass to upstream commit
tenpercent 9dfec0d
update flash-attention to upstream commit
tenpercent 9fcda18
simplify setup.py
tenpercent 01c2bfd
Merge branch 'main' of https://github.com/facebookresearch/xformers i…
tenpercent 58d38d4
remove duplicate run_batched_infer_causalmask_attnbias_dispatched<f16…
tenpercent 07183f0
add hip version and pytorch hip arch list to xformers build info
tenpercent 993a90c
fix build
tenpercent d4a374b
patch around the unhappy path in get_hip_version
tenpercent ff59f19
skip test_grad_checkpointing for triton_splitk since it doesn't have …
tenpercent 81bcfd5
re-enable test_mqa_forward since ck tiled is the current implementation
tenpercent a0f7f27
make skip test_wrong_alignment more generic
tenpercent a0d8dcc
reapply black
tenpercent bc7035c
simplify test_decoder
tenpercent f02d0d4
put python version check inside triton_splitk op
tenpercent 77a6c13
fix logic
tenpercent a7cd678
cleanup python3.9 checks in tests
tenpercent dea783d
cleanup test_attentions
tenpercent acd6b7a
cleanup test_checkpoint as test running on cpu does not depend on gpu…
tenpercent f467a1d
fix lints
tenpercent d758eac
try fixing win build by conditional import of triton in triton op
tenpercent 21f1904
re-enable test_triton_layernorm as it passes
tenpercent d880c36
re-enable test_triton_blocksparse as it passes
tenpercent 059c84f
cleanup test_sparse_tensors
tenpercent 8aa0bdc
cleanup test_custom_ops
tenpercent 5bc7bbe
reapply black
tenpercent 5b4ebe4
cleanup test_core_attention
tenpercent 473ebc7
benchmark ck ops on rocm only
tenpercent 2a7272e
Merge branch 'main' of https://github.com/facebookresearch/xformers i…
tenpercent 5d3247f
fix mypy
tenpercent 9be7f8d
Merge branch 'dev_upstream' of https://github.com/ROCm/xformers into …
tenpercent 58b0f75
fix lint: black
tenpercent 03b7294
fix lints: mypy
tenpercent 0666088
split-k decoder: move all tunable parameters to the top of cpp file
tenpercent 04eec8d
apply clang-format
tenpercent a02ab9b
Rename HDim/headdim to MaxK/maxk
qianfengz fd36725
Move some headers files to ck examples for later reusing
qianfengz 41f5ada
Merge branch 'develop' of https://github.com/ROCm/xformers into develop
qianfengz d8384c1
Replace using qs_ks_vs pipeline by qr_ks_vs pipeline while HeadDim is…
qianfengz 10346df
rm test_ck_7
tenpercent bbfe112
Merge branch 'main' of https://github.com/facebookresearch/xformers i…
tenpercent dd3f4a9
Merge branch 'main' into develop
qianfengz 08b4159
dump kernel resource usage to compilation logs similar to nv
tenpercent ce99d22
Merge branch 'facebookresearch:main' into develop
tenpercent 7637c61
Merge pull request #4 from ROCm/move-splitk-tune-params
qianfengz 2da2927
Add the c++ extension to the latest change of ck_tile/dev fwd kernel …
qianfengz 9189e45
Add the c++ extension to use ck_tile/dev/ fmha bwd kernel
qianfengz 28e713d
Update to add dropout for fmah backward
qianfengz 4ef7eba
Update in attention.cpp to align efficient_attention_backward_ck inte…
qianfengz 48a5f3e
Enable BwdOp in ck.py
qianfengz 2e45012
Support grad_out to have different strides as out
qianfengz b382f23
Merge branch 'facebookresearch:main' into develop
tenpercent 566d26f
Force seqstart_q/seqstart_k to be in device memory in ck.py
qianfengz fc6c4a6
Remove duplicated codes in ck_tiled_fmha_grouped_forward.h/infer.h
qianfengz ff0db07
Use optimized async pipeline where 8x headdim length is assumed
qianfengz 0f4a171
Fix in batched_infer
qianfengz 0d6b915
Update to track ck_tile/opt_padding_fa_train_xformers branch
qianfengz df43559
Update rocm_ci.yml
tenpercent 4713576
Update to use the newer FmhaFwdEpilogue
qianfengz 9c2f5ce
Merge branch 'facebookresearch:main' into develop
tenpercent a745c45
Update rocm_ci.yml
tenpercent 95d0260
Update rocm_ci.yml
tenpercent 4069efe
copy rocm_ci workflow from main branch
tenpercent 724354c
Update rocm_ci.yml
tenpercent b1a1009
Update to use the newer FmhaFwdEpilogue for grouped infer/forward
qianfengz 97e4e20
Temporarily disable the using of QRKSVSAsync() pipeline
qianfengz e98877a
Update rocm_ci.yml
tenpercent 6fbd05d
Implement the ck_rand_uniform interface for generating random number …
qianfengz 2ef3b3f
Add dropout to the infer path (needed by xformers test_dropout)
qianfengz 930bb25
Update to support test_dropout and test_dropout_backward tests
qianfengz bdbc956
Update the padding method in batched_backward.h
qianfengz 44fff29
Update the OGradDotO kernel padding method
qianfengz d5c2d88
Change the backward padding checking condition
qianfengz ce9c23c
Add batch_stride_lse/d parameters to adapt grouped mode forward/backw…
qianfengz dafea78
Fill the grad_bias in advance
qianfengz 06ad689
Add support for kHasBiasGrad as instance template
qianfengz bdd6291
Remove using hdim_stride_do in fmha backward
qianfengz 410f814
Force kPadSeqLenQ/kPadSeqLenK to be true in batched-backward to save …
qianfengz 2712dff
Fix missing passing of {philox_seed, philox_offset} in inference path
qianfengz 7c27a82
Use SimplifiedGenericAttentionMask to replace GenericAttentionMask
qianfengz 46c491e
Shorten the instance file names
qianfengz 4c6c08d
Rename the template parameters
qianfengz 411ccd6
Simplify the names of the dispatch class and interfaces
qianfengz 812a529
Changes to reuse the kernel files under ck_tile examples/91_tile_prog…
qianfengz 51b4223
Update test_mem_eff_attention.py for test_dropout/test_dropout_backwa…
qianfengz d10ef79
Tiny change to the philox_cuda_state input setting
qianfengz 25bd720
Allocate logsumexp to ensure aligned access by each thread-group
qianfengz abfdc27
Add checking for query/key headdim size attention_backward_generic
qianfengz ff95367
Using ck_tile/opt_padding_fa_train_pr2 and synchronize the backward c…
qianfengz 93469ab
Enable using async pipeline in the batched inference path for perform…
qianfengz 2c8626b
Re-organize cpp instances for calling fmha infer kernel
qianfengz bdd716c
Re-organize cpp instances for calling fmha forward kernel
qianfengz 44d4592
Re-organize cpp instances for calling fmha backward kernel
qianfengz 51ca91b
Position the composable_kernel_tiled to ck_tile/opt_padding_fa_train …
qianfengz 1693683
Update to synchronize with the latest commits in ck_tile/opt_padding_…
qianfengz b7aa908
update submodule to public
carlushuang 9a878d9
Merge pull request #7 from ROCm/origin/test_opt_padding_train_public
qianfengz b4fa26d
Update to the criteria for padding seqlen_k in batched infer/forward
qianfengz ee7950f
Keep latest track of ck-tile commits
qianfengz 74dfdfe
Tiny fixing to the decoder including
qianfengz 410757e
Position the ck-tiled to ck_tile/opt_padding branch
qianfengz fa155eb
Merge branch 'test_opt_padding_train' of https://github.com/ROCm/xfor…
qianfengz 77514d5
Merge branch 'develop' into test_opt_padding_train
qianfengz 92924d4
Enable some attn_bias types which were previously disabled by old-ck …
qianfengz 23f64bd
Add script generate_instances.py which helps to generate instances
qianfengz d94b2c1
Simplify logic for seqstart_q/k
xw285cornell 2486b56
Add Async pipeline to grouped mode inference path
qianfengz 18b43c9
Use explict true for kPadSeqLenQ/kPadHeadDimQ/kPadHeadDimV templates …
qianfengz cf6cca0
Merge pull request #11 from xw285cornell/develop
qianfengz 14f7abe
Synchronize to the update of composable_kernel_tiled for better perfo…
qianfengz ee4aa87
Update rocm_ci.yml - clean up dangling images after ci run
tenpercent b0b5547
Avoid unused-const-variable warning
xw285cornell dfc196d
Tiny change in the BlockTile/Shape setting overriddings
qianfengz 2490166
Merge branch 'develop' of https://github.com/ROCm/xformers into develop
qianfengz f50861a
try to align fmha C++ extension to the ck_tile in ck develop branch
qianfengz 76fb485
Synchronize composable_kernel_tiled to latest ck develop
qianfengz 1f3add7
Use FmhaFwdTilePartitioner_HBS only with seqlen_k padded cases
qianfengz ed226f4
Merge branch 'main' into develop
qianfengz 9df93e5
Tiny fix/change to make test_forward/test_backward/test_dropout/test_…
qianfengz d6ccfa1
Fix compiling issue with regard to Invoker definitions in forward_dec…
qianfengz a7c7475
Keep using -Woverloaded-virtual
qianfengz b157b49
Fix clang-format for headers and cpp files
qianfengz b2fb213
Fix format in python scripts
qianfengz fdf8b8e
Add noqa: C801 for generate_instances.py
qianfengz 633a161
Align dispatch_bw with main branch
qianfengz 00cf683
Align ops/fmha/common.py with main branch
qianfengz 252844d
Synchronize the thirty-party/composable_kernel_tiled to latest ck_til…
qianfengz 610909e
Relax the atol for test_forward and test_dropout due to the using of …
qianfengz 10bf99c
Generate html report for tests run with rocm_ci.yml
tenpercent 16bb10b
archive test results when tests have failed
tenpercent 29c782b
Always clean up dangling docker images in rocm_ci
tenpercent 782d5a3
Bump python to 3.11 in rocm_ci.yml
tenpercent bd8ca1b
Disable flash attention tests rocm_ci.yml
tenpercent 77beb19
Try to fix rocm_ci.yml
tenpercent b0ae707
try to fix rocm_ci.yml flow by overriding PATH
tenpercent d2eeaf0
Fix setup.py path in rocm_ci.yml
tenpercent a62c93e
cd to xformers dir before running install in rocm_ci.yml
tenpercent d3ae25f
Use pip to install xformers in rocm_ci.yml
tenpercent d4e6abc
Possibly fix python version resolution in rocm_ci.yml
tenpercent 490b63d
Set the correct path for pytest in rocm_ci.yml
tenpercent addd2f2
remove test_reference_splitk as it was moved to a different file duri…
tenpercent 33810ff
make sure ck operators have a name to be visible in the dispatcher
tenpercent f3faa1a
fix sm version checks to happen only on CUDA, not ROCm
tenpercent 04e9481
(2/n) fix sm version checks to happen only on CUDA, not ROCm
tenpercent 9440282
Merge pull request #13 from xw285cornell/xdwang-develop
qianfengz bd49f48
Remove _check_large_shapes checking in fmha/ck.py (#1067)
qianfengz 0d1d1be
make xformers install editable to fix cpp extensions detection
tenpercent 9390d6a
Update to using the improved fmha-bwd (compiling passed)
qianfengz 22fce7e
Update to get 80% of the test_backward and test_dropout_backward_ck c…
qianfengz 463a475
Replace the using of ConvertGradQ by using torch tensor type converting
qianfengz 3427a6f
Change the tile settings for MaxK=32
qianfengz fbc7c50
Fix padding setting bug in grouped_backward
qianfengz 6e08666
Change -DCK_FMHA_FWD_FAST_EXP2=1 to -DCK_TILE_FMHA_FWD_FAST_EXP2=1
qianfengz 94ab599
Point the composable_kernel_tiled submodule to ck_tile/fa_bwd_opt branch
qianfengz 830697c
Disable flshattF and flshattB on ROCM
qianfengz afd7e02
Add -mllvm and -enable-post-misched=0 compiling options for ROCM on s…
qianfengz e67de41
Disable flshattF and flshattB on ROCM
qianfengz d72c2b3
Update to support separate grad_q_f32_strides do to the API change in…
qianfengz 5ddff31
Use old method for setting BlockDropout due to the revert in fmha_fwd…
qianfengz cf2b622
Tiny fix in grouped_backward
qianfengz 112aaed
Use packed tensor allocation for grad_q_f32
qianfengz dd83c62
Update to the ConvertGradQ kernel calling
qianfengz 3e9b99d
Tiny update
qianfengz 019448e
Fix the parameter location in grouped_backward
qianfengz c55966a
Adjust headdim128 tile shapes for better performance
qianfengz e22829a
Update backward kernel calling due to adding of nhead_stride_dk/nhead…
qianfengz cae1b77
Synchronize with CK to use separate pipeline for kPadHeadDim true of …
qianfengz e564f5e
Use convertDQ kernel
qianfengz b043765
Update to use unpadded lse layout
qianfengz c9e7595
Add explicit headdim256 instances for fmha backward
qianfengz 4a7b7dc
Add leaked headdim256 instance references
qianfengz 1ad9cbe
Change to generate.py and the re-generate the instance files using it
qianfengz 7db2aa4
Change to generate.py to generate instances refences and uses the gen…
qianfengz 73dbf32
Relax the RTOL of ckFwOp from 4e-4 to 3e-3 due to one big result case
qianfengz 0e6d0c3
Change to use .h rather than .hpp as suffix for generated header files
qianfengz 914ccc5
Fix in .gitignore
qianfengz 8503f87
Update to bwd setting to use only IGLP pipeline
qianfengz bfe164d
Synchronize to latest ck_tile fix and align the headdim64 tile shape …
qianfengz f75c3b2
Reformat the generated instances cpp files
qianfengz 520e6ed
Merge pull request #18 from ROCm/fa_bwd_opt_test
qianfengz bc3db99
Fix to the backward Trait
qianfengz fa6d8b3
Set occupancy to -1 to avoid the compiling warning
qianfengz c5c7cce
Revert "Set occupancy to -1 to avoid the compiling warning"
qianfengz d230433
Add environment variable and compiler definition to control the gener…
qianfengz 82a07ae
Add --ignore-hd256 argument to generate_instance.py and some update i…
qianfengz 38593d6
Add environment variable ENABLE_HIP_FMHA_RTN_BF16_CONVERT to enable u…
qianfengz 15dc911
Remove commented lines in test_mem_eff_attention.py
qianfengz 367274c
Synchronize to latest ck_tile commit
qianfengz f7b28c5
apply black
tenpercent fd82f20
apply flake8
tenpercent 7d21800
fix mypy
tenpercent d6b6456
revert disable flash operator on rocm
tenpercent 87188ea
Synchronize to ck_tile latest commit again
qianfengz 5be80a3
Re-position the composable_kernel submodule to the develop branch
qianfengz cee0980
Merge pull request #20 from tenpercent/develop
qianfengz 2a5c141
Avoid the Async pipeline when khasBias is true
qianfengz 2874842
clang-format for two files
qianfengz cbb557d
Merge branch 'main' into upstream_pr
qianfengz 1a73f34
Change allocation of grouped mode lse from [H, M] to [1, H, M] to mat…
qianfengz 4440714
Synchronize to the upstream rocm_ci workflows
qianfengz db2b52e
Re-format tests/test_mem_eff_attention.py
qianfengz d293caf
Change in generate_instances.py so that this scripts can be called fr…
qianfengz 8eb1bbd
Merge branch 'upstream_pr' of https://github.com/ROCm/xformers into u…
qianfengz ee9640a
Add GENERATE_INSTANCES.md
qianfengz 3cf5721
clean-up commented codes
qianfengz 01cc08e
Remove un-used test
qianfengz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Submodule composable_kernel_tiled
updated
526 files
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test got here as merge conflict resolution gone bad?