Paged Attention #333

vikigenius · 2023-06-26T23:33:57Z

Just found a recent blog https://vllm.ai/ and repo https://github.com/vllm-project/vllm that implements paged attention. Tested this out and it provides massive throughput and memory efficiency improvements.

Can we implement something like this? The paper isn't out yet. But shouldn't Rust be very good at this in theory with it's memory safety guarantees.

Mellonta · 2023-06-27T00:49:19Z

Does it have any benefit on cpu-only inference, given that host memory is already paged?

okpatil4u · 2023-06-28T04:10:06Z

@vikigenius could you please share your benchmarks with vllm vs llama.cpp for gpu ? That will give us some insight into potential speed up.

vikigenius · 2023-06-28T15:13:19Z

@okpatil4u I don't have the benchmarks for llama.cpp. I primarily noticed the speed up between the PyTorch implementations with and without paged attention. And there is no reason to think an algorithmic change like that wouldn't translate across languages.

We tested it on NVIDIA A100 GPUs and got significant speedup I will try to get the numbers soon, once we have access to them again.

vikigenius · 2023-06-28T15:41:58Z

@okpatil4u got the numbers now. Not a rigorous benchmark, but should still hold up since the gains are so significant.

WIth a 40 GB A100 GPU

Inference on a vicuna-13B model without paged attention produces 20 tokens / sec
Inference on a vicuna-13B model with paged attention produces 190 tokens / sec

So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention.

okpatil4u · 2023-06-28T15:51:15Z

Wow, this is amazing. Thanks for postint. But are you sure if vicuna 13b llama.cpp is benchmarking at 50 ms/token on an A40 gpu ? I would expect it to be a bit faster.

…

On Wed, 28 Jun 2023 at 9:12 PM, Vikash ***@***.***> wrote: @okpatil4u <https://github.com/okpatil4u> got the numbers now. Not a rigorous benchmark, but should still hold up since the gains are so significant. WIth a 40 GB A100 GPU Inference on a vicuna-13B model without paged attention produces 20 tokens / sec Inference on a vicuna-13B model with paged attention produces 190 tokens / sec So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention. — Reply to this email directly, view it on GitHub <#333 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4CYRRIDUFY4V4ZOTNLXNRGFFANCNFSM6AAAAAAZU2FPHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

vikigenius · 2023-06-28T15:55:01Z

Well as I mentioned before we don't actually use llama.cpp at work in our A100s, so my benchmark numbers are comparing pytorch implementations.

It is possible that at this point llama.cpp itself is a bit better than the PyTorch implementation which might explain the discrepancy.

But given how big the gain is I would expect that if you port Paged Attention to llama.cpp you should see similar gains there as well ?

vikigenius · 2023-06-28T15:59:01Z

The discussion here might be relevant ggerganov/llama.cpp#1955 although it seems many people are misunderstanding how the paging works.

It should be hugely beneficial for any batched inference workloads even on a single GPU

philpax · 2023-06-28T20:59:45Z

Unfortunately, we are likely beholden to what upstream GGML supports, as this would be applied at that layer of the execution. This is something we could potentially implement with #312, but even then we'd need to work with wonnx to support this.

I'll leave this issue open for now, but I don't think we'll see much movement here from our end, sorry :/

AmineDiro · 2023-10-22T17:34:59Z

Hello,
I recently saw ggerganov PR ggerganov/llama.cpp#3228 where he implemented parallel decoding for multiple sequences. Is there any plan on supporting this feature ?
This would basically provide a mechanism for doing batch inference 🤔
Thx

philpax · 2023-10-31T22:48:23Z

Hi, that would be nice to have! I'm not sure if we'll get around to it any time soon as it'll require updating our GGML version and setting up all of the required structures, but I'll see what can be done once we get there.

philpax added issue:enhancement New feature or request topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features labels Jul 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paged Attention #333

Paged Attention #333

vikigenius commented Jun 26, 2023

Mellonta commented Jun 27, 2023

okpatil4u commented Jun 28, 2023

vikigenius commented Jun 28, 2023

vikigenius commented Jun 28, 2023

okpatil4u commented Jun 28, 2023 via email

vikigenius commented Jun 28, 2023

vikigenius commented Jun 28, 2023

philpax commented Jun 28, 2023

AmineDiro commented Oct 22, 2023

philpax commented Oct 31, 2023

Paged Attention #333

Paged Attention #333

Comments

vikigenius commented Jun 26, 2023

Mellonta commented Jun 27, 2023

okpatil4u commented Jun 28, 2023

vikigenius commented Jun 28, 2023

vikigenius commented Jun 28, 2023

okpatil4u commented Jun 28, 2023 via email

vikigenius commented Jun 28, 2023

vikigenius commented Jun 28, 2023

philpax commented Jun 28, 2023

AmineDiro commented Oct 22, 2023

philpax commented Oct 31, 2023