[WIP] Dynamic length in static cache #30862

ydshieh · 2024-05-16T14:04:39Z

What does this PR do?

The current version is a minimal change that works, maybe not the best way

Current static cache is nice (when running with torch.compile). However, in each generation step, the new position (to be generated) computes the attentions against all positions in the cache, which is not optimal. In fact, we only need to compute the attentions against the positions prior the current position.

This PR implement dynamic length computation with static cache, which work with torch.compile. The following table demonstrate the speedup gain (with torch.compile) of this implementation over the current main branch.

The correctness is verified by

RUN_SLOW=1 TF_FORCE_GPU_ALLOW_GROWTH=true python3 -m pytest -v tests/models/gemma/test_modeling_gemma.py -k "test_compile_static_cache"

The data below is based on

this script

import os
import torch
import datetime

from transformers import AutoTokenizer, AutoModelForCausalLM

token = "ADD_YOUR_OWN_TOKEN"

os.environ["TOKENIZERS_PARALLELISM"] = "false"

batch_size = 1
n_iter = 5

ckpt = "google/gemma-2b"

tokenizer = AutoTokenizer.from_pretrained(ckpt, token=token)
model = AutoModelForCausalLM.from_pretrained(ckpt, token=token, torch_dtype=torch.float16).to("cuda")

model.generation_config.max_new_tokens = 1024
model.generation_config.max_new_tokens = 1024

model.generation_config.cache_implementation = "static"
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

input_text = "Why dogs are cute."
input_ids = tokenizer([input_text] * batch_size, return_tensors="pt").to("cuda")

for i in range(n_iter):
    s = datetime.datetime.now()
    outputs = model.generate(**input_ids, do_sample=False)
    t = datetime.datetime.now()
    e = (t-s).total_seconds()
    print(e)

with some modification to run it with different configurations, running on A100 with torch==2.3+cu121.

Benchmark

I will re-run (part of) the benchmark as the following numbers are on top of of an older commit of main

benchmark data on the hub

Static cache compiled: full length v.s. optimal length (this PR)

gemma-2b (18 layers)

seq. length	speedup
1024	1.03 x
2048	1.11 x
4096	1.24 x
8192	1.38 x

HuggingFaceDocBuilderDev · 2024-05-16T14:28:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ydshieh · 2024-05-16T15:53:18Z

src/transformers/models/gemma/modeling_gemma.py

@@ -1218,6 +1231,7 @@ def prepare_inputs_for_generation(
                "past_key_values": past_key_values,
                "use_cache": use_cache,
                "attention_mask": attention_mask,
+                "_length": int(cache_position[-1]) + 1,


This is redundant of cache_position, however, this is the only way I can figure out to make the dynamic length computation works with torch.compile.

gante

TBH I also think it makes sense slicing useless ops, I wondered about this question myself :) The speedup of 15% is very nice, and I can confirm the speedup on my setup as well! (RTX3090) 🔥

Regarding the API (_length): I understand why it is done. Without an interger in the signature of GemmaSdpaAttention.forward, slicing tensors like key_states will always fail due to it being a data-dependent operation (=forbidden) OR producing a variable length tensor. With an integer, each value for the integer has its own compiled function with data-independent tensor slicing.

Still, if we are to go forward, we should find a better solution for the API. StaticCache already introduced the cache_position input, this would further complicate the API. I see three possible paths:

cache_position becomes a list of integers instead of a tensor, we use cache_position[-1] + 1 to slice the tensors;
we pass the full cache_position array (a torch.arange up to the sequence length). The different shape of cache_position in each GemmaSdpaAttention.forward will trigger recompilation, solving the dynamic shape problem
instead of cache_position, we use the sequence length (=_length, an int) to control generation with static cache.

Note that in all 3 cases, the StaticCache needs a tensor like the current cache_position. However, it should be trivial to build from any of the solutions above. From a usage perspective, option 3 is probably the easiest to understand. @ArthurZucker @ydshieh WDYT?

ydshieh · 2024-05-21T13:26:30Z

Regarding the API (_length): I understand why it is done. Without an interger in the signature of GemmaSdpaAttention.forward, slicing tensors like key_states will always fail due to it being a data-dependent operation (=forbidden) OR producing a variable length tensor.

Exactly!

For option 2, I am a bit worried. For example,

        if past_key_value is not None:
            # sin and cos are specific to RoPE models; cache_position needed for the static cache
            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

Currently, cache_position is only 1 element (after the first step). If we go for optioin2, it will be full length. Then we are updating the whole cache. Of course, the key_states, value_states arguments in the update is just the last part (in the sequence), and we will have to slice cache_position here too. So the issue of data-dependent operation still pop up here.

I personally would prefer option 3 for its simplicity as long as we can re-build the tensor (cache_position) that is required by update and other places requiring it. Would like to hear from @ArthurZucker too.

ArthurZucker · 2024-05-23T10:04:03Z

Hey before having a look, when you mention speedups, I don't think it makes sense to compute anything that does not use the full number of layers.
Also how many tokens are generated? Is this speedup only for the prefill phase?

ydshieh · 2024-05-23T13:24:14Z

@ArthurZucker I am running on A10, so even with gemma-2 (18 layers), I can only compile with 768 sequence length.
However, as you can see from the tables, more layers more speedup, and longer sequence more speedup too.

Also how many tokens are generated?

from 256 to 8192 (as long as it could compile within A10 GPU memory).

The speedup gain and the reason behind it is kind easy to see. However, if there Is any extra particular case(s) you want me to perform?

ArthurZucker · 2024-05-23T14:36:32Z

That is what was not clear for me I wanted to know the amount of generated tokens not the prefill 😉
And most importantly, the new argument is pretty annoying 😓

ydshieh · 2024-05-23T16:13:22Z

Yes. We probably need to come up with a good new approach as @gante suggested.
I will run full layers (18) for google/gemma-2 in the meantime.

ArthurZucker

Waiting for the final bench!

ArthurZucker · 2024-05-23T16:59:22Z

src/transformers/models/gemma/modeling_gemma.py

+        if _length > 0:
+            key_states = key_states[:, :, :_length, :]
+            value_states = value_states[:, :, :_length, :]
+            causal_mask = causal_mask[:, :, :, :_length] if causal_mask is not None else causal_mask


this can only be an int, if it's a list, there is bound to be device transfer.

yeah, so far it is int. Let me run the final bench first and come back to the API design

zucchini-nlp · 2024-05-24T12:29:03Z

@ydshieh I tried to use your implementation in my PR. I am also trying to get the actual length in compiled models, but in my case length is used to decide which rope scaling to do. Therefore, passing length as model kwarg fails with dynamic control flow in fullgraph setting.

So, what do you guys think on going back to seen_tokens attribute of the cache class. It doesn't cause control flow errors because cache is still a model attribute, and we update seen_tokens as we generate instead of passing every forward pass. And I think it will work for kv-cropping done in this PR

cc @gante @ArthurZucker ?

ydshieh · 2024-05-24T12:34:20Z

@ydshieh I tried to use your implementation in my PR. I am also trying to get the actual length in compiled models, but in my case length is used to decide which rope scaling to do. Therefore, passing length as model kwarg fails with dynamic control flow in fullgraph setting.

but in my case length is used to decide which rope scaling to do. Therefore, passing length as model kwarg fails with dynamic control flow in fullgraph setting.

Could you ping the lines where length is used + where the compile issues. I could take a look.

Oh, you use the length as conditional?

zucchini-nlp · 2024-05-24T12:38:03Z

Yes, in Phi3 RoPE it's used as conditional and I've been trying to compile it

ydshieh · 2024-05-24T12:45:07Z

Do you still have that commit (where you interoperate your PR with mine and leads to compile failure) ? If so, could you share please 🙏

zucchini-nlp · 2024-05-24T12:51:09Z

Sorry, I reverted your changes back but I just pushed the one which works for me, with "seen_tokens". I get the length here and then use it in self.rotary_emb

zucchini-nlp · 2024-05-27T07:47:07Z

Update: we discussed with @ydshieh using the length in cond control flow. It works indeed, but only in torch 2.3.0 or.2.4.0. In the 2.2.0 it would fail.

So this feature will also benefit Phi3 compilation, when merged :)

ydshieh · 2024-05-27T12:17:42Z

Hi @gante

When I run with TORCH_LOGS="recompiles" on main (95b3c381) and this PR (862cde4c), the only recompilation in both commits happens at the second call to the forward (see below) which makes sense.

So this PR doesn't introduce any extra recompilation.

(if we call generate with another input with different sequence, there would be one more recompilation. But after that, everything is ready to use and no further recompilation even if a 3rd input is given with different length)

V0527 12:12:26.470280 140416858965824 torch/_dynamo/guards.py:1425] [__recompiles] Recompiling function forward in /transformers/src/transformers/models/gemma/modeling_gemma.py:1058
V0527 12:12:26.470280 140416858965824 torch/_dynamo/guards.py:1425] [__recompiles]     triggered by the following guard failure(s):
V0527 12:12:26.470280 140416858965824 torch/_dynamo/guards.py:1425] [__recompiles]     - tensor 'L['input_ids']' stride mismatch at index 0. expected 6, actual 7
11.347857

ydshieh · 2024-05-28T09:55:55Z

it should be trivial to build from any of the solutions above

cache_position becomes a list of integers instead of a tensor, we use cache_position[-1] + 1 to slice the tensors;

But we will need a tensor in StaticCache.update (that is what @ArthurZucker told me), so this option is not good I think.

we pass the full cache_position array (a torch.arange up to the sequence length). The different shape of cache_position in each GemmaSdpaAttention.forward will trigger recompilation, solving the dynamic shape problem

instead of cache_position, we use the sequence length (=_length, an int) to control generation with static cache.

Given a length (say _length) or a full cache_position along, it's not enough to reconstruct the (current) cache_position. The problem is that we don't know if we are in the first generation step or the steps after it in order to determine if we want to reconstruct a full cache_position or a single (current) position to be used in StaticCache.update.

We can probably use q_len, but it is obtained from a input tensor. I don't know if this will work well with torch.compile.

@gante Do you have any comment regarding this and something you think I could give it a try?

gante · 2024-06-14T11:04:58Z

@ydshieh sorry for the delayed response, I've now placed this issue on top of my priorities 🤗

Regarding your previous comment:

cache_position becomes a list of integers instead of a tensor, we use cache_position[-1] + 1 to slice the tensors;

But we will need a tensor in StaticCache.update (that is what @ArthurZucker told me), so this option is not good I think.

I believe we can convert the cache_positions (of list type) to a tensor right before calling StaticCache.update, getting the best of both worlds 🙌 I think this is the path with minimal API changes -- assuming it works, the only needed change is the type of the input!

ydshieh

Regarding this dynamic length computation within static cache for torch.compile , it turns out that we get even more speedup for a long (enough) cache size (so we don't need to recreate the cache object that trigger recompile and very slow) while the generation is short compared to cache length.

The following figures shows we can have even 6x speedup (when the cache size is 16384 and the decoding steps is finished earlier, say 256 tokens).

Of course, it's arguable about the usage of such long cache size, and we also need to run against real dataset and/or batch generation, but even with cache size 4096, a 2x speedup is still there.

p.s. long cache size has some issue to compile, but that issue also present in our main branch and so far is not caused only by this dynamic length of this PR. (edited)
3 files

ydshieh · 2024-06-14T11:29:42Z

src/transformers/cache_utils.py

+class CacheInfo:
+
+    def __init__(self, position, length):
+        self.position = position
+        self._length = length


Introduce this class to encapsulate all the information (position, length) instead of passing those as separate arguments.

ydshieh · 2024-06-14T12:06:07Z

I believe we can convert the cache_positions (of list type) to a tensor right before calling StaticCache.update, getting the best of both worlds 🙌

OK, I can test this approach

ydshieh · 2024-06-14T15:04:55Z

@gante

I tried it, but there are some slow down in the compile timing (the first/second iteration). See the numbers below.
(well, also some slow down after the first 2 iterations. I have to check with longer decoding steps (say 1024 - 8192) to see how much increased).

See the changes here: it's not optimized but just to try the idea quickly.

The slow down is likely due to the overhead of python/torch switching/conversion.

Furthermore, compared to the new class CacheInfo approach, there are more places to change (i.e. some previous torch operations have to be changed to list operations).

Let me know what you think, especially about the increased compile time.

On T4, decoding 256 steps (without dynamic length involved, just compare tensor v.s. list approach)

with cache_position being tensor:

with cache_position being list but converted to tensor before calling update:

on A100 (with decoding steps: 1024)

v.s.

gante · 2024-06-14T17:01:04Z

@ydshieh thank you for exploring the alternative! 💛 The execution time is indeed the key metric, and it is clearly inferior (probably because it leads to more recompiled code sections).

Last counter-proposal: have you tried using _length alone, i.e. NOT passing cache_position and rebuilding it inside the attention layers' forward with torch.arange, from _length + input shape? In terms of API, it would also be preferable, a simple integer is preferable to a custom class :)

(Happy to try it if you're low on bandwidth!)

ydshieh · 2024-06-14T17:13:00Z

Could try it next week. However FYI, cache_position is also used in several places, like

prepare_inputs_for_generation: past_length
_update_causal_mask: causal_mask depends on it
GemmaModel.forward: position_ids depends on it

Also _assisted_decoding seems some special logic involved cache_position

            if "cache_position" in candidate_kwargs:
                candidate_kwargs["cache_position"] = torch.cat(
                    (
                        candidate_kwargs["cache_position"],
                        torch.arange(cur_len, cur_len + candidate_length, device=input_ids.device, dtype=torch.long),
                    ),
                    dim=0,
                )

That is why I am somehow afraid to break things 😅

gante · 2024-06-14T17:21:12Z

@ydshieh I think we can solve all those cases from _length :D Assuming it works, and that the speedups are similar, I think it's well worth the effort 💪

src/transformers/models/gemma/modeling_gemma.py

helunwencser · 2024-07-29T23:33:31Z

hi @ydshieh , @gante, is there any update on this PR? I want to use phi-3 with static kv cache. This PR seems super useful. Can we do similar change for phi-3 as well?

ArthurZucker · 2024-07-30T07:11:26Z

Static cache for phi 3 will need a separate PR to support cache positions

ydshieh · 2024-07-30T17:35:24Z

@gante Let me know your thoughts on the current POC whenever you get the time to take a loot. Thanks.

guangy10 · 2024-08-15T18:20:28Z

@ArthurZucker @gante What is the plan to move this work/PR forward?

ArthurZucker

Sounds promising, not sure if @ydshieh had time to pick this back up but looks good overall. Needs heavy testing tho!

ArthurZucker · 2024-08-19T14:11:39Z

src/transformers/models/gemma/modeling_gemma.py

+        if q_len > 1:
+            # prefill
+            cache_position = torch.arange(cache_length, dtype=torch.int32, device=hidden_states.device)
+        else:
+            # decoding
+            cache_position = torch.tensor([cache_length - 1], dtype=torch.int32, device=hidden_states.device)


I am very much supprised that this now works in torch compile with reduce overhead, as when testing this use to always create the same tensor (constant cache lenght) outputs were different. WOuld need investigation on which torch version supports this!

ArthurZucker · 2024-08-19T14:14:26Z

@guangy10 would this help for torch export ?

ydshieh requested review from ArthurZucker, gante and zucchini-nlp May 16, 2024 15:51

ydshieh commented May 16, 2024

View reviewed changes

gante reviewed May 17, 2024

View reviewed changes

ydshieh mentioned this pull request May 23, 2024

Quantized KV Cache #30483

Merged

ArthurZucker reviewed May 23, 2024

View reviewed changes

ydshieh force-pushed the dynamic_length_in_static_cache branch from 862cde4 to b447901 Compare May 27, 2024 14:34

ydshieh commented Jun 14, 2024

View reviewed changes

ydshieh force-pushed the dynamic_length_in_static_cache branch from c0300c3 to 9168904 Compare July 4, 2024 08:05

Use cache_info

9ab68d0

ydshieh force-pushed the dynamic_length_in_static_cache branch from 0258a4e to 9ab68d0 Compare July 4, 2024 08:13

ydshieh added 16 commits July 4, 2024 10:13

[run-slow] gemma

81e795a

[run-slow] gemma

c1098f9

[run-slow] gemma

b7eaf50

reconstruct cache_position from _length: 0001

888a2c0

reconstruct cache_position from _length: 0002

01cd35f

reconstruct cache_position from _length: 0003

f93b239

[run-slow] gemma

8029b7f

fix

b6f30f5

[run-slow] gemma

33ef0b1

remove cache info

3ca52cf

fix

dcd292b

[run-slow] gemma

843876b

fix

f47f4a8

[run-slow] gemma

8d4e17b

fix

84e694d

[run-slow] gemma

4df64e6

ydshieh commented Jul 4, 2024

View reviewed changes

src/transformers/models/gemma/modeling_gemma.py Outdated Show resolved Hide resolved

fix

450b1d2

This was referenced Jul 30, 2024

Cannot use StaticCache with Phi3 #32338

Closed

Remove size check between attn_weights and kv_seq_len for phi3 #32339

Merged

guangy10 mentioned this pull request Aug 7, 2024

Export to ExecuTorch #32253

Open

26 tasks

ArthurZucker reviewed Aug 19, 2024

View reviewed changes

ArthurZucker added the run-benchmark label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Dynamic length in static cache #30862

[WIP] Dynamic length in static cache #30862

ydshieh commented May 16, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented May 16, 2024

ydshieh May 16, 2024

gante left a comment •

edited

Loading

ydshieh commented May 21, 2024

ArthurZucker commented May 23, 2024

ydshieh commented May 23, 2024 •

edited

Loading

ArthurZucker commented May 23, 2024

ydshieh commented May 23, 2024

ArthurZucker left a comment

ArthurZucker May 23, 2024

ydshieh May 23, 2024

zucchini-nlp commented May 24, 2024

ydshieh commented May 24, 2024 •

edited

Loading

zucchini-nlp commented May 24, 2024

ydshieh commented May 24, 2024

zucchini-nlp commented May 24, 2024

zucchini-nlp commented May 27, 2024

ydshieh commented May 27, 2024

ydshieh commented May 28, 2024

gante commented Jun 14, 2024

ydshieh left a comment

ydshieh Jun 14, 2024

ydshieh commented Jun 14, 2024

ydshieh commented Jun 14, 2024 •

edited

Loading

gante commented Jun 14, 2024 •

edited

Loading

ydshieh commented Jun 14, 2024

gante commented Jun 14, 2024 •

edited

Loading

helunwencser commented Jul 29, 2024 •

edited

Loading

ArthurZucker commented Jul 30, 2024

ydshieh commented Jul 30, 2024

guangy10 commented Aug 15, 2024

ArthurZucker left a comment

ArthurZucker Aug 19, 2024

ArthurZucker commented Aug 19, 2024

[WIP] Dynamic length in static cache #30862

Are you sure you want to change the base?

[WIP] Dynamic length in static cache #30862

Conversation

ydshieh commented May 16, 2024 • edited Loading

What does this PR do?

Benchmark

Static cache compiled: full length v.s. optimal length (this PR)

gemma-2b (18 layers)

HuggingFaceDocBuilderDev commented May 16, 2024

ydshieh May 16, 2024

Choose a reason for hiding this comment

gante left a comment • edited Loading

Choose a reason for hiding this comment

ydshieh commented May 21, 2024

ArthurZucker commented May 23, 2024

ydshieh commented May 23, 2024 • edited Loading

ArthurZucker commented May 23, 2024

ydshieh commented May 23, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker May 23, 2024

Choose a reason for hiding this comment

ydshieh May 23, 2024

Choose a reason for hiding this comment

zucchini-nlp commented May 24, 2024

ydshieh commented May 24, 2024 • edited Loading

zucchini-nlp commented May 24, 2024

ydshieh commented May 24, 2024

zucchini-nlp commented May 24, 2024

zucchini-nlp commented May 27, 2024

ydshieh commented May 27, 2024

ydshieh commented May 28, 2024

gante commented Jun 14, 2024

ydshieh left a comment

Choose a reason for hiding this comment

ydshieh Jun 14, 2024

Choose a reason for hiding this comment

ydshieh commented Jun 14, 2024

ydshieh commented Jun 14, 2024 • edited Loading

On T4, decoding 256 steps (without dynamic length involved, just compare tensor v.s. list approach)

on A100 (with decoding steps: 1024)

gante commented Jun 14, 2024 • edited Loading

ydshieh commented Jun 14, 2024

gante commented Jun 14, 2024 • edited Loading

helunwencser commented Jul 29, 2024 • edited Loading

ArthurZucker commented Jul 30, 2024

ydshieh commented Jul 30, 2024

guangy10 commented Aug 15, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Aug 19, 2024

Choose a reason for hiding this comment

ArthurZucker commented Aug 19, 2024

ydshieh commented May 16, 2024 •

edited

Loading

gante left a comment •

edited

Loading

ydshieh commented May 23, 2024 •

edited

Loading

ydshieh commented May 24, 2024 •

edited

Loading

ydshieh commented Jun 14, 2024 •

edited

Loading

gante commented Jun 14, 2024 •

edited

Loading

gante commented Jun 14, 2024 •

edited

Loading

helunwencser commented Jul 29, 2024 •

edited

Loading