[Kernel] Unify the kernel used in flash attention backend #6052

LiuXiaoxuanPKU · 2024-07-02T00:45:25Z

Currently, we are using different kernels for different phases. Concretely, we use flash_attn_with_kvcache for decoding phase and flash_attn_varlen_func for prefill phase and prefix caching. For chunked prefill, we will launch both kernels to handle prefill tokens and decoding tokens separately. The current way has some drawbacks:

This complicates the attention backend logic because we need prefill_metadata and decode_metadata.
Pass more fields from the model_runner to the backend than needed because flash_attn_with_kvcache and flash_attn_varlen_func have different requirements for the input.
Use two kernels for the chunked prefill, which is not performance optimal.
Potential performance degradation because we need to build prefill_metadata and decode_metadata on the fly. But this might be minor since we cache the two metadata.

Moreover, flash_attn_with_kvcache and flash_attn_varlen_func have similar performance as they share the same underlying implementation.
Ideally, we should use a single kernel to handle all cases, including prefill phase, decoding phase, and prefix caching. For chunked prefill, we should just launch a single kernel to handle both the prefill tokens and decoding tokens.

This PR tries to simply the logic in the attention backend and use a single kernel. This is also needed for the MQA scorer (#5691) for speculative decoding.

rkooo567 · 2024-07-02T11:07:00Z

I think the direction makes sense! It is also more cuda graph friendly approach

QQ

Is this PR ready?
Original reason why I didn't try this before was that I heard the perf wasn't that different (or worse due to some optimizations for decode case). Can you share the benchmark result?

LiuXiaoxuanPKU · 2024-07-09T04:43:33Z

Yeah, the PR should be ready for review.

Some kernel benchmark numbers on a single A100, all numbers are in ms.

Number of queries tokens	Number of heads	Head dim	flash_attn_varlen_func	flash_attn_with_kvcache
100	12	64	0.0917	0.0365
500	12	64	0.379	0.383
1000	12	64	1.292	1.290
100	32	128	0.0989	0.100
500	32	128	1.550	1.549
1000	32	128	5.819	5.837
100	64	128	0.161	0.160
500	64	128	2.965	3.004
1000	64	128	11.308	11.388

Only one case we see great performance degradation

Number of queries tokens	Number of heads	Head dim	flash_attn_varlen_func	flash_attn_with_kvcache
100	12	64	0.0917	0.0365

In all other cases, the performance is quite similar.

comaniac

Overall LGTM and it's much cleaner now!
cc @rkooo567 and @cadedaniel to have a final pass.

comaniac · 2024-07-15T23:08:28Z

vllm/attention/backends/flash_attn.py

+    # Fields that are not used in flash attention backend,
+    # but used in other backends
+    context_lens_tensor: Optional[torch.Tensor] = None
+    seq_lens_tensor: Optional[torch.Tensor] = None
+    max_prefill_seq_len: Optional[int] = None
+    max_decode_seq_len: Optional[int] = None


Good finding! I'll remove them after refactoring prepare input.

comaniac · 2024-07-15T23:09:35Z

vllm/attention/backends/flash_attn.py

+        if kv_cache is None or (attn_metadata.block_tables is not None
+                                and attn_metadata.block_tables.numel()) == 0:
+            k = key
+            v = value
+            block_tables = None


This should be for pure prefill or memory profiling? Better to add comment for it.

comaniac · 2024-07-15T23:23:06Z

vllm/spec_decode/draft_model_runner.py

@@ -151,8 +151,8 @@ def execute_model(
            # Currently cuda graph is only supported by the decode phase.
            assert model_input.attn_metadata is not None
            prefill_meta = model_input.attn_metadata.prefill_metadata
-            decode_meta = model_input.attn_metadata.decode_metadata
-            if prefill_meta is None and decode_meta.use_cuda_graph:
+            if prefill_meta is None and \


Note: This code snippet is removed by #6338 so this isn't a problem anymore.

comaniac · 2024-07-15T23:23:48Z

vllm/worker/model_runner.py

@@ -655,6 +655,7 @@ def _prepare_model_input_tensors(
                input_positions.append(0)
                slot_mapping.append(_PAD_SLOT_ID)
                seq_lens.append(1)
+                query_lens.append(1)


This is used when calculating query_start_loc, which is the input for the flash attention backend when using the unified kernel.

comaniac · 2024-07-15T23:25:08Z

tests/worker/test_model_runner.py

-    for attr_expected, attr_actual in zip(vars(attn_metadata.decode_metadata),
-                                          vars(decode_meta_actual)):
-        assert attr_expected[1] == attr_actual[1]
+    if attn_metadata.prefill_metadata:


Is it always None for flash attention backend now?

rkooo567 · 2024-07-15T23:48:44Z

The review ETA is tonight!

Besides, I'd like to know the e2e performance improvement (or that it matches the performance). Is it possible to run some e2e benchmark with/without the PR and share the result?

rkooo567

(I prefer to see the e2e result before merging it! but PR looks beautiful :))

…o flash-attn-unify

jjjjohnson · 2024-07-22T02:49:23Z

Looks like the model output is chaos and totally different after unifying the kernel...
I changed the flash_attn.py to the original implementation with flash_attn_with_kvcache for decode and flash_attn_varlen_func for prefill and the result is normal.
Have you check the correctness? @LiuXiaoxuanPKU

jjjjohnson · 2024-07-22T13:09:08Z

If I add --enforce-eager, which disables cuda graph, the model output text is normal. But if I enable cuda graph, the output is totaly different. @comaniac @rkooo567
My guess is flash_attn_varlen_func ONLY works in NO cuda graph mode... But I not know why

LiuXiaoxuanPKU · 2024-07-22T13:25:47Z

Thanks for reporting, will take a look.

LiuXiaoxuanPKU · 2024-07-23T12:38:52Z

@jjjjohnson Could you provide the model/prompt you used for testing. The results seem correct for basic_correctness. Thanks!

LiuXiaoxuanPKU · 2024-07-23T13:44:16Z

@rkooo567
Some e2e performance numbers of llama-7b on a single H100 with cuda graph. All numbers are 50% percentile request latency in seconds measured with the script.

input_len	output_len	batch_size	this PR	main branch
32	128	1	0.883	0.917
32	128	2	0.877	0.912
32	128	4	0.906	0.933
32	128	8	0.946	0.956
32	128	16	1.065	1.084
32	128	32	1.236	1.259
512	32	1	0.229	0.251
512	32	2	0.237	0.259
512	32	4	0.267	0.288
512	32	8	0.326	0.347
512	32	16	0.456	0.498
512	32	32	0.699	0.779

LiuXiaoxuanPKU · 2024-07-23T20:36:45Z

I now can reproduce the bug with tests/lora/test_chatglm3.py, where if I put enforce-eager, the test can pass, otherwise the test fails. I'm wondering if your case is related to lora. I cannot reproduce the bug without lora.

rkooo567 · 2024-07-23T20:38:26Z

Hmm that's pretty odd. there's nothing lora-related in this kernel iiuc

comaniac · 2024-07-23T21:24:46Z

btw I saw a CI failure in LM Eval Small Models as follows

[2024-07-23T14:54:54Z] >               assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
[2024-07-23T14:54:54Z] E               assert False
--
  | [2024-07-23T14:54:54Z] E                +  where False = <function isclose at 0x7f72938ba070>(0.593, 0.0, rtol=0.02)
  | [2024-07-23T14:54:54Z] E                +    where <function isclose at 0x7f72938ba070> = numpy.isclose

Looks like the measured_value is 0, so the output may be garbage in this case.

jjjjohnson · 2024-07-24T06:49:52Z

I tried Qwen/Qwen-14B-Chat, without lora, can be any prompt, the result is totally different with or without enforce-eager

jjjjohnson · 2024-07-24T09:41:23Z

Looks like short prompt is OK, if you change to example_long_prompts the tests fails...

LiuXiaoxuanPKU · 2024-07-24T11:35:34Z

I tried the example_long_prompts with Qwen and it did fail. But after looking into that, it fails for both eager and non-eager mode. It also failed for other backends such as XFORMERS. Therefore, it seems like numerical issues in that case. Did you observe similar things?

LiuXiaoxuanPKU · 2024-07-24T11:47:37Z

Could you provide the exact prompt and the hardware you use? After some manual checking on H100 with Qwen/Qwen-14B-Chat. Setting enforce-eager or not give the same output. It might also be possible that bugs with cuda graph preparation are not stable. Thanks!

jjjjohnson · 2024-07-25T08:47:53Z

I use A800 TP1.
Prompt:
The rapid advancement in artificial intelligence (AI) has yielded a variety of groundbreaking technologies, among which Large Language Models (LLMs) have garnered widespread attention and utility. LLMs, such as OpenAI’s GPT-4, Google's BERT, and others, have profoundly transformed the landscape of natural language processing (NLP) over the past few years. But what exactly are LLMs, and why are they so significant?At their core, LLMs are a subset of machine learning models designed to understand.LLMs are versatile and can be fine-tuned for a variety of applications. From drafting emails and writing code to translating languages and composing poetry, the potential use cases are vast.

If I change
@pytest.mark.parametrize("backend", ["FLASH_ATTN","XFORMERS"])
to
@pytest.mark.parametrize("backend", ["XFORMERS","FLASH_ATTN"])
The tests get passes... Pretty odd...

jon-chuang · 2024-08-09T00:41:02Z

@jjjjohnson , when you say the test fails, was the output gibberish or still something reasonable? Changing the kernel may change the numerics slightly?

I think there is more likelihood to accumulate numerical error for long prompts so this checks out?

LiuXiaoxuanPKU · 2024-08-19T18:13:06Z

Updates for this PR:

We will take a less aggressive approach. We will keep the original kernels for prefill and decoding. We will use flash_attn_varlen_func for mixed batch. Mixed batch means batches with prefill tokens and decoding tokens. The goal is to enable cuda graph for chunked prefill and speculative decoding.
We need to debug the cudagraph compability for flash_attn_varlen_func kernel as it fails the unit tests.

pengwu22 · 2024-09-20T02:04:20Z

Hi @LiuXiaoxuanPKU

Based on your current test case defined in tests/kernels/test_flash_attn.py; here is a modified version: test_varlen_cg.py

It should pass the given case for mixed prefill and decode now, with vllm_flash_attn v2.6.2. python3 -m pytest test_varlen_cg.py

The major modifications are the following when use flash_attn_varlen_func with cuda graph:

We need to keep the max_query_len and max_kv_len static, to ensure CPU var takes no effect on results.
- Given the params.num_split is 0 now and we still provide page table, it will dispatch to flash_fwd_splitkv_kernel. So we need to keep i) all the kernel grid dims static, which uses the max_query_len ii) the kernel template static, which uses the max_kv_len.
Static GPU memory for g_cu_query_lens and g_cu_kv_lens and g_block_tables; and pad the rest batch index with the non-decreasing seqlens.
- Cuz their shape has a batch dimension, we need to keep them static and pad the rest. It thus requires the capture uses the largest number of query token needed. For example, we capture with [(1, 1), (1, 1), (1,1), (1, 1), (1, 1),(1, 1),(1, 1)] and can run with [[(5, 18), (1, 473), (1, 6),(0,0),(0,0),(0,0),(0,0)]]. (0,0) here are needed to keep the prepared padded cu_*_lens non-decreasing, so that GPU blocks responsible for the padded dim won't pollute the result.

Feel free to try it out. Hope it helps :)

mergify · 2024-11-26T05:51:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LiuXiaoxuanPKU.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

unify flash attention backend kernel

4a5f7b6

LiuXiaoxuanPKU requested review from rkooo567 and comaniac July 2, 2024 00:45

LiuXiaoxuanPKU marked this pull request as draft July 2, 2024 04:55

LiuXiaoxuanPKU changed the title ~~[Kernel] Unify the kernel used in flash attention backend~~ [WIP][Kernel] Unify the kernel used in flash attention backend Jul 2, 2024

comaniac mentioned this pull request Jul 2, 2024

[Draft][Core] Refactor _prepare_model_input_tensors #5972

Closed

LiuXiaoxuanPKU added 3 commits July 9, 2024 03:53

Merge branch 'main' into flash-attn-unify

0830ea2

fix

e449f00

fix xformer backend

037c634

LiuXiaoxuanPKU marked this pull request as ready for review July 9, 2024 04:30

fix cudagraph

66d3347

LiuXiaoxuanPKU changed the title ~~[WIP][Kernel] Unify the kernel used in flash attention backend~~ [Kernel] Unify the kernel used in flash attention backend Jul 9, 2024

LiuXiaoxuanPKU mentioned this pull request Jul 9, 2024

[WIP] [Speculative Decoding] Use MQA kernel for target model verification #5691

Closed

2 tasks

comaniac mentioned this pull request Jul 10, 2024

[Core] Refactor _prepare_model_input_tensors - take 2 #6164

Merged

LiuXiaoxuanPKU added 3 commits July 12, 2024 15:33

fix ci

b1d2b5c

Merge branch 'vllm-project:main' into flash-attn-unify

53f0adf

fix ci

2461c8a

LiuXiaoxuanPKU requested a review from cadedaniel July 15, 2024 15:22

comaniac approved these changes Jul 15, 2024

View reviewed changes

rkooo567 approved these changes Jul 15, 2024

View reviewed changes

rkooo567 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 16, 2024

LiuXiaoxuanPKU added 3 commits July 17, 2024 19:12

Merge branch 'main' into flash-attn-unify

4225bcf

fix tests and style

1618fff

Merge branch 'flash-attn-unify' of github.com:LiuXiaoxuanPKU/vllm int…

81c221b

…o flash-attn-unify

cadedaniel mentioned this pull request Jul 22, 2024

[Core][Speculative Decoding] Add multi-query verifier for speculative decoding without batch expansion #6185

Closed

LiuXiaoxuanPKU added 2 commits July 23, 2024 05:49

merge

7b56dc5

minor

548b4d8

minor

3c16563

LiuXiaoxuanPKU added 4 commits July 25, 2024 07:27

minor

0c0f6c8

Merge branch 'main' into flash-attn-unify

953db02

add for example

7d970f9

add for example

66e832b

This was referenced Aug 13, 2024

[Feature]: Integrate flash-infer FP8 KV Cache Chunked-Prefill (Append Attention) #7450

Open

Chunked prefill support flashinfer-ai/flashinfer#392

Closed

This was referenced Aug 30, 2024

Fix a wrong reference to seqlen_k variable in the varlen kernel Dao-AILab/flash-attention#1192

Closed

Fix a wrong reference to seqlen_k variable in the varlen kernel vllm-project/flash-attention#18

Merged

simon-mo requested review from tlrmchlsmth and WoosukKwon as code owners November 26, 2024 05:49

mergify bot added the needs-rebase label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Unify the kernel used in flash attention backend #6052

[Kernel] Unify the kernel used in flash attention backend #6052

LiuXiaoxuanPKU commented Jul 2, 2024 •

edited

Loading

rkooo567 commented Jul 2, 2024

LiuXiaoxuanPKU commented Jul 9, 2024 •

edited

Loading

comaniac left a comment •

edited

Loading

comaniac Jul 15, 2024

comaniac Jul 15, 2024

comaniac Jul 15, 2024

comaniac Jul 15, 2024

LiuXiaoxuanPKU Jul 18, 2024

comaniac Jul 15, 2024

rkooo567 commented Jul 15, 2024

rkooo567 left a comment

jjjjohnson commented Jul 22, 2024 •

edited

Loading

jjjjohnson commented Jul 22, 2024 •

edited

Loading

LiuXiaoxuanPKU commented Jul 22, 2024

LiuXiaoxuanPKU commented Jul 23, 2024

LiuXiaoxuanPKU commented Jul 23, 2024

LiuXiaoxuanPKU commented Jul 23, 2024

rkooo567 commented Jul 23, 2024

comaniac commented Jul 23, 2024

jjjjohnson commented Jul 24, 2024 •

edited

Loading

jjjjohnson commented Jul 24, 2024 •

edited

Loading

LiuXiaoxuanPKU commented Jul 24, 2024

LiuXiaoxuanPKU commented Jul 24, 2024 •

edited

Loading

jjjjohnson commented Jul 25, 2024 •

edited

Loading

jon-chuang commented Aug 9, 2024 •

edited

Loading

LiuXiaoxuanPKU commented Aug 19, 2024

pengwu22 commented Sep 20, 2024

mergify bot commented Nov 26, 2024

[Kernel] Unify the kernel used in flash attention backend #6052

Are you sure you want to change the base?

[Kernel] Unify the kernel used in flash attention backend #6052

Conversation

LiuXiaoxuanPKU commented Jul 2, 2024 • edited Loading

rkooo567 commented Jul 2, 2024

LiuXiaoxuanPKU commented Jul 9, 2024 • edited Loading

comaniac left a comment • edited Loading

Choose a reason for hiding this comment

comaniac Jul 15, 2024

Choose a reason for hiding this comment

comaniac Jul 15, 2024

Choose a reason for hiding this comment

comaniac Jul 15, 2024

Choose a reason for hiding this comment

comaniac Jul 15, 2024

Choose a reason for hiding this comment

LiuXiaoxuanPKU Jul 18, 2024

Choose a reason for hiding this comment

comaniac Jul 15, 2024

Choose a reason for hiding this comment

rkooo567 commented Jul 15, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

jjjjohnson commented Jul 22, 2024 • edited Loading

jjjjohnson commented Jul 22, 2024 • edited Loading

LiuXiaoxuanPKU commented Jul 22, 2024

LiuXiaoxuanPKU commented Jul 23, 2024

LiuXiaoxuanPKU commented Jul 23, 2024

LiuXiaoxuanPKU commented Jul 23, 2024

rkooo567 commented Jul 23, 2024

comaniac commented Jul 23, 2024

jjjjohnson commented Jul 24, 2024 • edited Loading

jjjjohnson commented Jul 24, 2024 • edited Loading

LiuXiaoxuanPKU commented Jul 24, 2024

LiuXiaoxuanPKU commented Jul 24, 2024 • edited Loading

jjjjohnson commented Jul 25, 2024 • edited Loading

jon-chuang commented Aug 9, 2024 • edited Loading

LiuXiaoxuanPKU commented Aug 19, 2024

pengwu22 commented Sep 20, 2024

mergify bot commented Nov 26, 2024

LiuXiaoxuanPKU commented Jul 2, 2024 •

edited

Loading

LiuXiaoxuanPKU commented Jul 9, 2024 •

edited

Loading

comaniac left a comment •

edited

Loading

jjjjohnson commented Jul 22, 2024 •

edited

Loading

jjjjohnson commented Jul 22, 2024 •

edited

Loading

jjjjohnson commented Jul 24, 2024 •

edited

Loading

jjjjohnson commented Jul 24, 2024 •

edited

Loading

LiuXiaoxuanPKU commented Jul 24, 2024 •

edited

Loading

jjjjohnson commented Jul 25, 2024 •

edited

Loading

jon-chuang commented Aug 9, 2024 •

edited

Loading