Change workspace size in fmha backward #1028

alihassanijr · 2024-04-16T01:24:22Z

What does this PR do?

TLDR; I think the FMHA backward kernel uses more scratch memory than it needs, aside from padding due to 128-bit alignment and tile sizes.

FMHA backward's scratch space for gK and gV is set up to be: num_k_splits * align_up(num_keys, kBlockSizeJ) * align_up(dim, kBlockSizeI) repeated over batch and heads.

Given that each CTA computes unique tiles of gK and gV, this means for every gK/gV tile, align_up(num_keys, kBlockSizeJ) * align_up(dim, kBlockSizeI) accum elements are reserved.

I might be totally off, but my understanding is that the gK and gV accumulator pointers aren't even offset by something relating to key_start, which means the same kBlockSizeJ rows will be reused over an over.

This means that align_up(num_keys, kBlockSizeJ) can be replaced with just kBlockSizeJ.

My own use case works fine and passes a memcheck with this change. All unit tests passed for me locally (excluding the recent torch.compile test in test_mem_eff_attention; I think I need to be on torch nightly?).

Kernel archtags tested:

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

@danthe3rd

FMHA backward's scratch space for gK and gV is set up to be: num_k_splits * align_up(num_keys, kBlockSizeJ) * align_up(dim, kBlockSizeI) repeated over batch and heads. Given that each CTA computes unique tiles of gK and gV, this means for every gK/gV tile, align_up(num_keys, kBlockSizeJ) * align_up(dim, kBlockSizeI) accum elements are reserved. I might be totally off, but my understanding is that the gK and gV accumulator pointers aren't even offset by something relating to key_start, which means the same kBlockSizeJ rows will be reused over an over. This means that `align_up(num_keys, kBlockSizeJ)` can be replaced with just kBlockSizeJ. My own use case works fine and passes a memcheck with this change. All unit tests pass (excluding the recent torch.compile test in test_mem_eff_attention; I think I need to be on torch nightly?).

codecov-commenter · 2024-04-16T01:51:24Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.92%. Comparing base (5d59023) to head (071b1f0).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1028   +/-   ##
=======================================
  Coverage   59.92%   59.92%           
=======================================
  Files         113      113           
  Lines       10007    10007           
=======================================
  Hits         5997     5997           
  Misses       4010     4010

Flag	Coverage Δ
Python	`59.92% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

danthe3rd

Thanks! That's a nice catch :o
Indeed this memory over-allocated is never used in the kernel. Let me triple check by running some internal tests and I'll merge this :)
cc @drisspg

danthe3rd · 2024-04-16T09:25:55Z

All tests pass - merging
Thanks a lot for spotting and submitting a fix!

Backporting a few fixes from xFormers: * Bug fixes for local attention (which is not exposed in PT at the moment) * Massively reduced memory usage on the BW pass (see also facebookresearch/xformers#1028) Essentially this will also make xFormers build process much easier, as we will be able to use mem-eff from PyTorch (if the user has a recent enough version) rather than building it at xFormers install time The goal is to have the source of truth for these files in PT moving forward, and remove them from xFormers eventually once our users have a recent-enough version of PT. Pull Request resolved: #127090 Approved by: https://github.com/drisspg

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 16, 2024

danthe3rd approved these changes Apr 16, 2024

View reviewed changes

danthe3rd merged commit f663712 into facebookresearch:main Apr 16, 2024
8 of 9 checks passed

alihassanijr deleted the fmha-backward-gK-gV-workspace-size branch April 16, 2024 15:30

danthe3rd mentioned this pull request May 24, 2024

[SDPA/memeff] Backport changes from xFormers to PT pytorch/pytorch#127090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change workspace size in fmha backward #1028

Change workspace size in fmha backward #1028

alihassanijr commented Apr 16, 2024 •

edited

Loading

codecov-commenter commented Apr 16, 2024

danthe3rd left a comment •

edited

Loading

danthe3rd commented Apr 16, 2024

Change workspace size in fmha backward #1028

Change workspace size in fmha backward #1028

Conversation

alihassanijr commented Apr 16, 2024 • edited Loading

What does this PR do?

Before submitting

PR review

codecov-commenter commented Apr 16, 2024

Codecov Report

danthe3rd left a comment • edited Loading

Choose a reason for hiding this comment

danthe3rd commented Apr 16, 2024

alihassanijr commented Apr 16, 2024 •

edited

Loading

danthe3rd left a comment •

edited

Loading