[Kernel] Use flash-attn for decoding #3648

skrider · 2024-03-26T23:05:32Z

Vendors flash-attention from Dao-AILab/flash-attention#824, prunes out the backward pass operator for faster compile times, adds reshape and cache kernel for flash attention kv cache layout, adds logic for selecting kv cache manager / attention backend based on temporary environment variable VLLM_TEMP_USE_FLASH_DECODE. Tested for single GPU on opt-125m, llama-7b

simon-mo · 2024-03-27T17:19:35Z

We will try to get this in best effort by tomorrow, if not, this will be slated for next release

WoosukKwon · 2024-03-27T18:34:23Z

@skrider Thanks for the great work! Can I directly fix this PR for faster integration?

WoosukKwon · 2024-03-27T20:14:14Z

vllm/attention/ops/flash_attn.py

+        max_subquery_len: int,
+        alibi_slopes: Optional[torch.Tensor],
+    ) -> torch.Tensor:
+        raise NotImplementedError


@skrider Could you elaborate on why it is tricky to implement prefix-enabled attention?

The flash API does not support attention to dense KV and paged KV in the same launch. We can either cache the dense KV before invoking forward_prefix or compute attention separately and merge with the online softmax trick. Not sure which would be best.

@skrider isn't this possible with flash_attn_varlen_func? It seems like it supports paged kv cache (https://github.com/Dao-AILab/flash-attention/pull/831/files).

IIUC, you can just pass k cache and v cache to k and v for flash_attn_varlen_func?

Yes this is correct, however we need to compute attention with k_cache and k in one operation, which the kernel cannot do without first copying k,v into the kv cache.

We can either cache the dense KV before invoking forward_prefix

@skrider This is what our current Triton kernel is doing. Can't we do the same thing with flash-attn? Please take a look at the change I made in this PR.

My mistake - we should be good then

rkooo567

QQ: Do we really need a new backend implementation? Why don't we just adding env var to existing flash attn impl's decoding path?

rkooo567 · 2024-03-28T00:28:06Z

vllm/attention/ops/flash_attn.py

+        max_subquery_len: int,
+        alibi_slopes: Optional[torch.Tensor],
+    ) -> torch.Tensor:
+        raise NotImplementedError


@skrider isn't this possible with flash_attn_varlen_func? It seems like it supports paged kv cache (https://github.com/Dao-AILab/flash-attention/pull/831/files).

IIUC, you can just pass k cache and v cache to k and v for flash_attn_varlen_func?

skrider · 2024-03-28T02:15:13Z

@rkooo567 I think that because the KV cache layout is different it makes sense to have a different backend.

WoosukKwon · 2024-03-28T02:35:57Z

@skrider I just edited this PR: 1) I removed dependency on your FlashAttention repo (Let's add it in the next PR), 2) I enabled the prefix-attention, and 3) I moved this to FlashAttentionBackend as @rkooo567 suggested, since the backend actually existed for this integration. The current implementation was actually a placeholder. Could you please take a look?

zhaoyang-star · 2024-03-28T12:03:58Z

Unittest is needed.

Yard1 · 2024-05-13T19:34:20Z

vllm/worker/model_runner.py

+                if self.attn_backend.get_name() == "flash-attn":
+                    block_table = seq_group_metadata.block_tables[seq_id]
+                else:
+                    block_table = computed_block_nums


this suggests to me we should use some abstraction here instead of if...else branching. A method on attn_backend perhaps

Good point. Agreed. For now, I added a note on why we need this if, and added TODO asking for a better abstraction.

I think the simplest solution here is to fix the Triton kernel so that it aligns with the other backends' APIs. I will leave this for future PR though.

tests/kernels/test_flash_attn.py

rkooo567 · 2024-05-13T23:16:25Z

oh yay! I will run the benchmark on a100 today

This reverts commit 1356df5.

Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit 1356df5. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

…lm-project#4820) Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit 1356df5. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)

wooyeonlee0 · 2024-05-20T10:43:26Z

Great! Thanks for the PR. @skrider

It seems that this PR has introduced the following constraint.
_SUPPORTED_HEAD_SIZES = [32, 64, 96, 128, 160, 192, 224, 256] link

I've been using vllm for testing the model with head size not listed in the above list.

Then from now on, can't I run my model on vllm with the flash_attn option?
Does vllm before this PR (v0.4.2) work properly with the model with head size not listed in the list?

I look forward to hearing from you.
Thank you in advance!

rkooo567 · 2024-05-21T01:10:45Z

I think if this head size is not supported, you cannot use the flash attn (it is the limitation of flash attn).

rkooo567 · 2024-05-21T01:11:00Z

(to make it work, you should probably make flash attn work with unspecified head sizes)

wooyeonlee0 · 2024-05-21T01:21:38Z

@rkooo567 Oh, I didn't know that flash attn has this limitation. Thanks for the information!

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

…lm-project#4820) Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit 1356df5. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

…lm-project#4820) Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit 1356df5. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)

skrider added 6 commits March 27, 2024 05:21

vendor flash-attention

Unverified

The committer email address is not verified.

Learn about vigilant mode

bbf023b

update vendored flash-attention

38d422d

add reshape_and_cache_flash

00dc8ad

refactor reshape_and_cache_flash

4ffe256

refactor reshape_and_cache_flash

6a2ddf4

implement flash attention decode backend

Loading
Loading status checks…

0e45f5d

skrider force-pushed the flash-attention-decode branch from 04f8c75 to 0e45f5d Compare March 27, 2024 06:26

WoosukKwon added the release-blocker label Mar 27, 2024

WoosukKwon self-assigned this Mar 27, 2024

WoosukKwon added 5 commits March 27, 2024 19:37

FlashAttentionDecode -> FlashAttention

45c3662

Remove gitmodule

Loading
Loading status checks…

18132d2

Minor

Loading
Loading status checks…

3cc5ebd

Minor

Loading
Loading status checks…

70f6b16

Remove submodule

Loading
Loading status checks…

1ebf12e

WoosukKwon reviewed Mar 27, 2024

View reviewed changes

rkooo567 reviewed Mar 28, 2024

View reviewed changes

WoosukKwon added 2 commits March 28, 2024 02:28

Use prefix-enabled attention

Loading
Loading status checks…

8a209ff

Disable flash-attn backend

Loading
Loading status checks…

31f741d

WoosukKwon added 3 commits March 28, 2024 02:42

Minor

Loading
Loading status checks…

70efebb

Remove __ldg for AMD portability

Loading
Loading status checks…

56b78e6

Remove assert

Loading
Loading status checks…

f119396

WoosukKwon added 4 commits March 28, 2024 15:26

Add causal=True

Loading
Loading status checks…

b6a1833

Enable when vllm_flash_attn

Loading
Loading status checks…

da50678

Merge branch 'main' into flash-attention-decode

6d5b4ec

Add vllm-flash-attn as dependency

Loading
Loading status checks…

37cb5a9

WoosukKwon added 2 commits May 13, 2024 17:54

yapf

Loading
Loading status checks…

1be2eb3

Use fp32 in ref attn softmax

Loading
Loading status checks…

d544611

Yard1 reviewed May 13, 2024

View reviewed changes

tests/kernels/test_flash_attn.py Outdated Show resolved Hide resolved

WoosukKwon added 3 commits May 13, 2024 19:53

Fix broken tests

ddd9e35

Address comments

Loading
Loading status checks…

7e0da78

Fix CI

Loading
Loading status checks…

cd22037

WoosukKwon merged commit 1356df5 into vllm-project:main May 13, 2024
43 of 51 checks passed

rkooo567 added a commit to rkooo567/vllm that referenced this pull request May 15, 2024

Revert "[Kernel] Use flash-attn for decoding (vllm-project#3648)"

fab469e

This reverts commit 1356df5.

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024

[Kernel] Use flash-attn for decoding (vllm-project#3648)

00d6bd6

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

[Kernel] Use flash-attn for decoding (vllm-project#3648)

28c395f

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

sivanantha321 mentioned this pull request Jun 5, 2024

Add nccl package and Bump vLLM to 0.4.3 for huggingface runtime kserve/kserve#3723

Merged

8 tasks

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Kernel] Use flash-attn for decoding (vllm-project#3648)

a6de2a3

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

[Kernel] Use flash-attn for decoding #3648

[Kernel] Use flash-attn for decoding #3648

skrider commented Mar 26, 2024 •

edited

Loading

simon-mo commented Mar 27, 2024

WoosukKwon commented Mar 27, 2024

WoosukKwon Mar 27, 2024

skrider Mar 27, 2024

rkooo567 Mar 28, 2024 •

edited

Loading

skrider Mar 28, 2024 •

edited

Loading

WoosukKwon Mar 28, 2024

skrider Mar 28, 2024

rkooo567 left a comment •

edited

Loading

rkooo567 Mar 28, 2024 •

edited

Loading

skrider commented Mar 28, 2024

WoosukKwon commented Mar 28, 2024 •

edited

Loading

zhaoyang-star commented Mar 28, 2024

Yard1 May 13, 2024 •

edited

Loading

WoosukKwon May 13, 2024

WoosukKwon May 13, 2024

rkooo567 commented May 13, 2024

wooyeonlee0 commented May 20, 2024

rkooo567 commented May 21, 2024

rkooo567 commented May 21, 2024

wooyeonlee0 commented May 21, 2024

[Kernel] Use flash-attn for decoding #3648

[Kernel] Use flash-attn for decoding #3648

Conversation

skrider commented Mar 26, 2024 • edited Loading

simon-mo commented Mar 27, 2024

WoosukKwon commented Mar 27, 2024

WoosukKwon Mar 27, 2024

Choose a reason for hiding this comment

skrider Mar 27, 2024

Choose a reason for hiding this comment

rkooo567 Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

skrider Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

WoosukKwon Mar 28, 2024

Choose a reason for hiding this comment

skrider Mar 28, 2024

Choose a reason for hiding this comment

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

rkooo567 Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

skrider commented Mar 28, 2024

WoosukKwon commented Mar 28, 2024 • edited Loading

zhaoyang-star commented Mar 28, 2024

Yard1 May 13, 2024 • edited Loading

Choose a reason for hiding this comment

WoosukKwon May 13, 2024

Choose a reason for hiding this comment

WoosukKwon May 13, 2024

Choose a reason for hiding this comment

rkooo567 commented May 13, 2024

wooyeonlee0 commented May 20, 2024

rkooo567 commented May 21, 2024

rkooo567 commented May 21, 2024

wooyeonlee0 commented May 21, 2024

skrider commented Mar 26, 2024 •

edited

Loading

rkooo567 Mar 28, 2024 •

edited

Loading

skrider Mar 28, 2024 •

edited

Loading

rkooo567 left a comment •

edited

Loading

rkooo567 Mar 28, 2024 •

edited

Loading

WoosukKwon commented Mar 28, 2024 •

edited

Loading

Yard1 May 13, 2024 •

edited

Loading