Investigate PagedAttention KV-cache memory management for faster inference #1955

Azeirah · 2023-06-20T22:47:57Z

New research just came out on using a technique inspired by kernel virtual memory and pages to manage the KV cache.

Results? Way faster inference!

They claim up to 24x the throughput (measured in requests handled per second) compared to huggingface's transformers library

How?

Inference is bottlenecked by memory, most notably the KV cache. They say the KV cache's most notable features are

That it's very large
That it's dynamic, size depends on sequence length which is variable. Existing systems waste 60-80% of memory due to fragmentation and over-reservation

PagedAttention is an alternative approach to managing the KV cache which is inspired by virtual memory, pages and blocks. By allocating the space dynamically with this approach, only up to about 4% of memory will be wasted, instead of the aforementioned 60-80.

For further details, better refer to their website and Github.

JohannesGaessler · 2023-06-21T09:04:49Z

llama.cpp currently only ever serves one user at a time so this optimization is not applicable.

nivibilla · 2023-06-21T09:57:16Z

I assume it would be useful if we want to host the models and have a interface like chat.openai.com?

JohannesGaessler · 2023-06-21T10:01:59Z

Yes, for enterprise use where you have one server generating responses for many users in parallel the optimization would be useful.

Azeirah · 2023-06-21T11:53:51Z

llama.cpp currently only ever serves one user at a time so this optimization is not applicable.

Oh I wasn't aware this was exclusively for a client-server application, that explains why they measure performance in requests/sec 🥲

howard0su · 2023-06-21T13:34:54Z

this optimization is still applicable as it can save vram usage of kv tensor.

nivibilla · 2023-06-21T13:50:56Z

If we do end up building this for server use and I think that would be a good idea. Then this paging system would be very useful.

howard0su · 2023-06-21T14:36:10Z

Read through the blog and the code. It turns out the paged attention is a way to manage the memory so that the compute kernel doesn't require kv have to be continues. This make it possible that you can have one prompt's kv append by multi output's KVs. like the following

Prompt KV Block ------ Output 1 KV Block
                            ------ Output 2 KV block
                              ....

This is super helpful if your prompt is long and you need to output multi results. This is a purely engineering trick. The change is mainly around the how we manage the KV in VRAM. If we are using CPU, this is even simpler to implement. (simple as list v.s. vector)

slaren · 2023-06-21T16:19:51Z

We allocate all the KV memory required for the maximum context length on startup in one block, so we shouldn't have any fragmentation either.

randxie · 2023-06-25T10:50:02Z

@JohannesGaessler Is serving multiple users concurrently or batch inference on the roadmap of llama.cpp?

JohannesGaessler · 2023-06-25T10:52:18Z

I don't have any plans for it because I don't care about commercial use but I can't speak for the other devs.

okpatil4u · 2023-06-25T10:55:37Z

Should it not be on the list ? Today we are talking about chatbots, in 6 months or so, people will start looking for autonomous agents. Would it not make sense to build a system that can process multiple requests simultaneously and efficiently ?

…

On Sun, 25 Jun 2023 at 4:20 PM, Rand Xie ***@***.***> wrote: @JohannesGaessler <https://github.com/JohannesGaessler> Is serving multiple users concurrently or batch inference on the roadmap of llama.cpp? — Reply to this email directly, view it on GitHub <#1955 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4HR4DKJ5ZMVHMQ2T7DXNAJWLANCNFSM6AAAAAAZN5MVXY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

nivibilla · 2023-06-25T11:48:14Z

Yeah, I think first we need to solve batch inference. It's implemented in babyllama but I'm haven't tried to port it over to the main llama yet

JohannesGaessler · 2023-06-25T12:27:32Z

I'm not really concerned with what other people want to use llama.cpp for. I'm implementing things that are useful for me personally first and foremost. And I don't see how I would benefit from batched inference since I only run llama.cpp for myself on my own hardware.

nivibilla · 2023-06-25T13:25:21Z

That's fair, batch inference would be useful for me use this at scale. For example if I want to do sentiment analysis for a large dataset or summarisation at scale.

nivibilla · 2023-06-25T13:26:05Z

And in this case having a server to handle multiple users at the same time

vikigenius · 2023-06-28T16:01:17Z

I have a comparison for the pytorch implementations with and without paging on a single GPU and the gains are significant. My use case is primarily batch inference, so I am not sure about model serving.

WIth a 40 GB A100 GPU

Inference on a vicuna-13B model without paged attention produces 20 tokens / sec
Inference on a vicuna-13B model with paged attention produces 190 tokens / sec

So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention.

okpatil4u · 2023-06-29T08:34:07Z

Thanks Vikash. You mentioned in another thread, that there may be some misalignment in terms of understanding, in this thread, on how vllm works. Could you please explain what you meant by it ?

Also, there have been other comments in terms its effect on CPU, GPU and Mac M1/M2 GPU in terms of performance. Could you or someone else shed some light on it ?

keeganmccallum · 2023-07-04T04:48:05Z

From what I understand this isn't so much related to multi-user/client-server use case so much as it it is batched inference, which does seem to be a valid use case even for single-user/local apps, depending on the use case

chrfalch · 2023-07-07T13:34:55Z

Wouldn’t the decreased memory requirement (they state that they cut 55% memory usage) be positive when running inference on smaller devices like phones and laptops as well?

FNsi · 2023-07-09T16:04:19Z

Should be useful if there's a large context.

viktor-ferenczi · 2023-09-10T01:15:05Z

Both vLLM and lmDeploy have high throughput batch-inference modes with various tricks. Problem is they don't support GGUF.

How complex would it be to port those tricks (KV cache paging, dynamic batching) to llama.cpp?

KerfuffleV2 · 2023-09-11T04:13:11Z

#2813 - still need to implement the non-tricky version.

Related, there's #2969 - also should be a 50% memory use reduction.

kiratp · 2023-09-11T06:21:40Z

#2813 only covers "same prompt, multiple output", not "multiple prompt, multiple output".

henk717 · 2023-09-13T01:04:03Z

Would like to voice my support for this, over at the KoboldAI community we had requests for multi-user support and it would also help out our Horde platform which currently benefits from TGI's speed but TGI has poor output for us compared to Llamacpp.

Having Llamacpp be fast for these use cases means multiple communities would begin using it as a general purpose inference server which would be a cool addition for the project (Once multiple requests can be queued up).

tikikun · 2023-09-13T04:37:01Z

I think this feature is important to make llama cpp usage spread even more

viktor-ferenczi · 2023-09-14T20:56:34Z

Which one would be easier? Porting performance/throughput tricks into llama.cpp or porting GGUF support into vLLM?

(lmDeploy is out of the picture, since they don't want to support GGUF. They closed the feature request / suggestion ticket, since they want to concentrate on other things.)

randxie · 2023-09-14T21:38:57Z

IMO, implementing the same idea inside llama.cpp is much better. Currently, vllm leverages Pytorch extension to customize the attention kernel. One benefit of llama.cpp is that it gets rid of pytorch and is more friendly to edge deployment.

We can consider porting the kernels in vllm into llama.cpp. It probably requires a certain amount of refactoring in llama.cpp though..

bobqianic · 2023-10-04T22:10:17Z

#3479

naik-amey · 2023-11-08T18:47:52Z

Where is the KVCacheManager implemented, is it on the GPU or host (CPU)?

github-actions · 2024-04-10T01:06:50Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

64933988 · 2024-04-10T03:13:44Z

很多的一项优化，居然没有人想集成进来！！！

phymbert · 2024-04-10T17:21:20Z

很多的一项优化，居然没有人想集成进来！！！

Please discuss in english here, and would you please elaborate which feature as of today no one want to integrate ?

K-Mistele · 2024-04-30T21:18:38Z

Worth re-opening? the server executable can handle multiple users at a time so it seems like this would be a really valuable thing to add.

YanlinWangWang · 2024-06-04T13:11:59Z

Worth re-opening? the server executable can handle multiple users at a time so it seems like this would be a really valuable thing to add.

And it can help reduce gpu-memory usage.I think it's time to start work

Azeirah changed the title ~~Investigate a PagedAttention for faster inference and lower memory usage~~ Investigate agedAttention for faster inference and lower memory usage Jun 20, 2023

Azeirah changed the title ~~Investigate agedAttention for faster inference and lower memory usage~~ Investigate agedAttention for faster inference Jun 20, 2023

Azeirah changed the title ~~Investigate agedAttention for faster inference~~ Investigate PagedAttention KV-cache memory management for faster inference Jun 20, 2023

vikigenius mentioned this issue Jun 28, 2023

Paged Attention rustformers/llm#333

Open

FNsi mentioned this issue Jul 9, 2023

Implement customizable RoPE #2054

Merged

bobqianic mentioned this issue Oct 4, 2023

llama : improve batched decoding performance #3479

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate PagedAttention KV-cache memory management for faster inference #1955

Investigate PagedAttention KV-cache memory management for faster inference #1955

Azeirah commented Jun 20, 2023 •

edited

Loading

JohannesGaessler commented Jun 21, 2023

nivibilla commented Jun 21, 2023

JohannesGaessler commented Jun 21, 2023

Azeirah commented Jun 21, 2023

howard0su commented Jun 21, 2023

nivibilla commented Jun 21, 2023

howard0su commented Jun 21, 2023 •

edited

Loading

slaren commented Jun 21, 2023

randxie commented Jun 25, 2023

JohannesGaessler commented Jun 25, 2023

okpatil4u commented Jun 25, 2023 via email

nivibilla commented Jun 25, 2023

JohannesGaessler commented Jun 25, 2023

nivibilla commented Jun 25, 2023

nivibilla commented Jun 25, 2023

vikigenius commented Jun 28, 2023

okpatil4u commented Jun 29, 2023

keeganmccallum commented Jul 4, 2023

chrfalch commented Jul 7, 2023

FNsi commented Jul 9, 2023

viktor-ferenczi commented Sep 10, 2023

KerfuffleV2 commented Sep 11, 2023

kiratp commented Sep 11, 2023

henk717 commented Sep 13, 2023 •

edited

Loading

tikikun commented Sep 13, 2023

viktor-ferenczi commented Sep 14, 2023 •

edited

Loading

randxie commented Sep 14, 2023

bobqianic commented Oct 4, 2023

naik-amey commented Nov 8, 2023

github-actions bot commented Apr 10, 2024

64933988 commented Apr 10, 2024

phymbert commented Apr 10, 2024

K-Mistele commented Apr 30, 2024

YanlinWangWang commented Jun 4, 2024

Investigate PagedAttention KV-cache memory management for faster inference #1955

Investigate PagedAttention KV-cache memory management for faster inference #1955

Comments

Azeirah commented Jun 20, 2023 • edited Loading

How?

JohannesGaessler commented Jun 21, 2023

nivibilla commented Jun 21, 2023

JohannesGaessler commented Jun 21, 2023

Azeirah commented Jun 21, 2023

howard0su commented Jun 21, 2023

nivibilla commented Jun 21, 2023

howard0su commented Jun 21, 2023 • edited Loading

slaren commented Jun 21, 2023

randxie commented Jun 25, 2023

JohannesGaessler commented Jun 25, 2023

okpatil4u commented Jun 25, 2023 via email

nivibilla commented Jun 25, 2023

JohannesGaessler commented Jun 25, 2023

nivibilla commented Jun 25, 2023

nivibilla commented Jun 25, 2023

vikigenius commented Jun 28, 2023

okpatil4u commented Jun 29, 2023

keeganmccallum commented Jul 4, 2023

chrfalch commented Jul 7, 2023

FNsi commented Jul 9, 2023

viktor-ferenczi commented Sep 10, 2023

KerfuffleV2 commented Sep 11, 2023

kiratp commented Sep 11, 2023

henk717 commented Sep 13, 2023 • edited Loading

tikikun commented Sep 13, 2023

viktor-ferenczi commented Sep 14, 2023 • edited Loading

randxie commented Sep 14, 2023

bobqianic commented Oct 4, 2023

naik-amey commented Nov 8, 2023

github-actions bot commented Apr 10, 2024

64933988 commented Apr 10, 2024

phymbert commented Apr 10, 2024

K-Mistele commented Apr 30, 2024

YanlinWangWang commented Jun 4, 2024

Azeirah commented Jun 20, 2023 •

edited

Loading

howard0su commented Jun 21, 2023 •

edited

Loading

henk717 commented Sep 13, 2023 •

edited

Loading

viktor-ferenczi commented Sep 14, 2023 •

edited

Loading