Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate PagedAttention KV-cache memory management for faster inference #1955

Closed
Azeirah opened this issue Jun 20, 2023 · 34 comments
Closed
Labels

Comments

@Azeirah
Copy link
Contributor

Azeirah commented Jun 20, 2023

New research just came out on using a technique inspired by kernel virtual memory and pages to manage the KV cache.

Results? Way faster inference!

https://vllm.ai/

They claim up to 24x the throughput (measured in requests handled per second) compared to huggingface's transformers library

afbeelding

How?

Inference is bottlenecked by memory, most notably the KV cache. They say the KV cache's most notable features are

  • That it's very large
  • That it's dynamic, size depends on sequence length which is variable. Existing systems waste 60-80% of memory due to fragmentation and over-reservation

PagedAttention is an alternative approach to managing the KV cache which is inspired by virtual memory, pages and blocks. By allocating the space dynamically with this approach, only up to about 4% of memory will be wasted, instead of the aforementioned 60-80.

For further details, better refer to their website and Github.

@Azeirah Azeirah changed the title Investigate a PagedAttention for faster inference and lower memory usage Investigate agedAttention for faster inference and lower memory usage Jun 20, 2023
@Azeirah Azeirah changed the title Investigate agedAttention for faster inference and lower memory usage Investigate agedAttention for faster inference Jun 20, 2023
@Azeirah Azeirah changed the title Investigate agedAttention for faster inference Investigate PagedAttention KV-cache memory management for faster inference Jun 20, 2023
@JohannesGaessler
Copy link
Collaborator

llama.cpp currently only ever serves one user at a time so this optimization is not applicable.

@nivibilla
Copy link
Contributor

I assume it would be useful if we want to host the models and have a interface like chat.openai.com?

@JohannesGaessler
Copy link
Collaborator

Yes, for enterprise use where you have one server generating responses for many users in parallel the optimization would be useful.

@Azeirah
Copy link
Contributor Author

Azeirah commented Jun 21, 2023

llama.cpp currently only ever serves one user at a time so this optimization is not applicable.

Oh I wasn't aware this was exclusively for a client-server application, that explains why they measure performance in requests/sec 🥲

@howard0su
Copy link
Collaborator

this optimization is still applicable as it can save vram usage of kv tensor.

@nivibilla
Copy link
Contributor

If we do end up building this for server use and I think that would be a good idea. Then this paging system would be very useful.

@howard0su
Copy link
Collaborator

howard0su commented Jun 21, 2023

Read through the blog and the code. It turns out the paged attention is a way to manage the memory so that the compute kernel doesn't require kv have to be continues. This make it possible that you can have one prompt's kv append by multi output's KVs. like the following

Prompt KV Block ------ Output 1 KV Block
                            ------ Output 2 KV block
                              ....

This is super helpful if your prompt is long and you need to output multi results. This is a purely engineering trick. The change is mainly around the how we manage the KV in VRAM. If we are using CPU, this is even simpler to implement. (simple as list v.s. vector)

@slaren
Copy link
Collaborator

slaren commented Jun 21, 2023

We allocate all the KV memory required for the maximum context length on startup in one block, so we shouldn't have any fragmentation either.

@randxie
Copy link
Contributor

randxie commented Jun 25, 2023

@JohannesGaessler Is serving multiple users concurrently or batch inference on the roadmap of llama.cpp?

@JohannesGaessler
Copy link
Collaborator

I don't have any plans for it because I don't care about commercial use but I can't speak for the other devs.

@okpatil4u
Copy link

okpatil4u commented Jun 25, 2023 via email

@nivibilla
Copy link
Contributor

Yeah, I think first we need to solve batch inference. It's implemented in babyllama but I'm haven't tried to port it over to the main llama yet

@JohannesGaessler
Copy link
Collaborator

I'm not really concerned with what other people want to use llama.cpp for. I'm implementing things that are useful for me personally first and foremost. And I don't see how I would benefit from batched inference since I only run llama.cpp for myself on my own hardware.

@nivibilla
Copy link
Contributor

That's fair, batch inference would be useful for me use this at scale. For example if I want to do sentiment analysis for a large dataset or summarisation at scale.

@nivibilla
Copy link
Contributor

And in this case having a server to handle multiple users at the same time

@vikigenius
Copy link

I have a comparison for the pytorch implementations with and without paging on a single GPU and the gains are significant. My use case is primarily batch inference, so I am not sure about model serving.

WIth a 40 GB A100 GPU

Inference on a vicuna-13B model without paged attention produces 20 tokens / sec
Inference on a vicuna-13B model with paged attention produces 190 tokens / sec

So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention.

@okpatil4u
Copy link

Thanks Vikash. You mentioned in another thread, that there may be some misalignment in terms of understanding, in this thread, on how vllm works. Could you please explain what you meant by it ?

Also, there have been other comments in terms its effect on CPU, GPU and Mac M1/M2 GPU in terms of performance. Could you or someone else shed some light on it ?

@keeganmccallum
Copy link

From what I understand this isn't so much related to multi-user/client-server use case so much as it it is batched inference, which does seem to be a valid use case even for single-user/local apps, depending on the use case

@chrfalch
Copy link
Contributor

chrfalch commented Jul 7, 2023

Wouldn’t the decreased memory requirement (they state that they cut 55% memory usage) be positive when running inference on smaller devices like phones and laptops as well?

@FNsi
Copy link
Contributor

FNsi commented Jul 9, 2023

Should be useful if there's a large context.

@viktor-ferenczi
Copy link

Both vLLM and lmDeploy have high throughput batch-inference modes with various tricks. Problem is they don't support GGUF.

How complex would it be to port those tricks (KV cache paging, dynamic batching) to llama.cpp?

@KerfuffleV2
Copy link
Collaborator

#2813 - still need to implement the non-tricky version.

Related, there's #2969 - also should be a 50% memory use reduction.

@kiratp
Copy link

kiratp commented Sep 11, 2023

#2813 only covers "same prompt, multiple output", not "multiple prompt, multiple output".

@henk717
Copy link

henk717 commented Sep 13, 2023

Would like to voice my support for this, over at the KoboldAI community we had requests for multi-user support and it would also help out our Horde platform which currently benefits from TGI's speed but TGI has poor output for us compared to Llamacpp.

Having Llamacpp be fast for these use cases means multiple communities would begin using it as a general purpose inference server which would be a cool addition for the project (Once multiple requests can be queued up).

@tikikun
Copy link
Contributor

tikikun commented Sep 13, 2023

I think this feature is important to make llama cpp usage spread even more

@viktor-ferenczi
Copy link

viktor-ferenczi commented Sep 14, 2023

Which one would be easier? Porting performance/throughput tricks into llama.cpp or porting GGUF support into vLLM?

(lmDeploy is out of the picture, since they don't want to support GGUF. They closed the feature request / suggestion ticket, since they want to concentrate on other things.)

@randxie
Copy link
Contributor

randxie commented Sep 14, 2023

IMO, implementing the same idea inside llama.cpp is much better. Currently, vllm leverages Pytorch extension to customize the attention kernel. One benefit of llama.cpp is that it gets rid of pytorch and is more friendly to edge deployment.

We can consider porting the kernels in vllm into llama.cpp. It probably requires a certain amount of refactoring in llama.cpp though..

@bobqianic
Copy link
Contributor

#3479

@naik-amey
Copy link

Where is the KVCacheManager implemented, is it on the GPU or host (CPU)?

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@64933988
Copy link

很多的一项优化,居然没有人想集成进来!!!

@phymbert
Copy link
Collaborator

很多的一项优化,居然没有人想集成进来!!!

Please discuss in english here, and would you please elaborate which feature as of today no one want to integrate ?

@K-Mistele
Copy link
Contributor

Worth re-opening? the server executable can handle multiple users at a time so it seems like this would be a really valuable thing to add.

@YanlinWangWang
Copy link

Worth re-opening? the server executable can handle multiple users at a time so it seems like this would be a really valuable thing to add.

And it can help reduce gpu-memory usage.I think it's time to start work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests