automatic prefix caching vs backtracking and forking #83

mmoskal · 2024-03-25T21:15:44Z

mmoskal
Mar 25, 2024
Maintainer

Right now, AICI assumes a stateful interface with LLM inference engine, where new sequences are created (forked) and the KV cache is manipulated by backtracking and fast-forwarding. As noted by @AaronFriel the Automatic Prefix Caching in vLLM (probably coming to other engines as well) might simplify this.

Starting discussion thread for comments.

cc @emrekiciman @simon-mo

mmoskal · 2024-03-25T22:02:07Z

mmoskal
Mar 25, 2024
Maintainer Author

Some random notes:

the latency of HTTPS request (assuming open socket) or WebSocket message is on the order of 30-50ms, so 1-2 tokens or so
if AICI becomes "single-threaded" we would need to have initial prompt be part of the request, so that cluster-level scheduler can decide where to send the request (to the machine that already has a large fraction of that prompt cached)
we probably still want backtrack/fast-forward on a single request - this accelerates decoding of highly-constrained outputs (like JSON with a specific schema) - here the 1 token delay from wide-area would be problem; this may be implemented with a new request internally though
we could drop the "var" stuff (or maybe make it single-threaded - they could still be used to communicate results out)

5 replies

AaronFriel Mar 26, 2024

Those HTTP request latencies seem high for local requests. If via loopback or on the same physical network, they should be sub-ms.

mmoskal Mar 26, 2024
Maintainer Author

Right, I was thinking more of the worse-ish case of separate data-centers (because model may be available only from one provider and you run your app in another).

AaronFriel Mar 26, 2024

Ah, yeah. I expect that with speed of light and the goal of driving tok/s up, AICI has to be local to your cluster to make it shine.

I actually think we solve this point:

if AICI becomes "single-threaded" we would need to have initial prompt be part of the request, so that cluster-level scheduler can decide where to send the request (to the machine that already has a large fraction of that prompt cached)

By integrating AICI into the stack between the frontend API (which may be OpenAI compatible, w/e) and the LLM, ideally 1:1 with model servers acting as a shim over the output stream.

AaronFriel Mar 26, 2024

I think that's what the vLLM PR is about, though?

mmoskal Mar 26, 2024
Maintainer Author

The thing that needs to be within 1ms worst case is the constrained decoding (computing token mask). Right now we assume it's the same machine.

Forking new requests etc is probably advisable to be within 1-2tokens, which can be a data-center nearby.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatic prefix caching vs backtracking and forking #83

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

automatic prefix caching vs backtracking and forking #83

mmoskal Mar 25, 2024 Maintainer

Replies: 1 comment · 5 replies

mmoskal Mar 25, 2024 Maintainer Author

AaronFriel Mar 26, 2024

mmoskal Mar 26, 2024 Maintainer Author

AaronFriel Mar 26, 2024

AaronFriel Mar 26, 2024

mmoskal Mar 26, 2024 Maintainer Author

mmoskal
Mar 25, 2024
Maintainer

Replies: 1 comment 5 replies

mmoskal
Mar 25, 2024
Maintainer Author

mmoskal Mar 26, 2024
Maintainer Author

mmoskal Mar 26, 2024
Maintainer Author