Server: Add prompt processing progress endpoint? #6586

stduhpf · 2024-04-10T11:35:31Z

Feature Description

It would be nice to have an endpoint on the server example to fetch information about the progress of an ongoing prompt processing It could return something like this:

{
    "processing": [true|false]
    "prompt_length": [number of uncached tokens of the last prompt]
    "remaining": [number of tokens yet to be processed]
}

Motivation

For longer prompts, or when the processing speed is very slow, it would be nice to get a clue about the advencement of the prompt processing. This would possibly also be useful for other projects, not just the server.

Possible Implementation

I haven't yet looked too deep in the current server implementation, so I can't really tell how this would work, but I imagine it would require some deeper changes in the backend too.
I did add a simillar feature on a very old project based on an ancient version of llama.cpp, a year ago: stduhpf/fastLLaMa@1ebd5ba This is now very much outdated, but this feature was nice to have.

The text was updated successfully, but these errors were encountered:

phymbert · 2024-04-10T12:00:20Z

Have you looked at the /slots endpoint? I think it's all you need

stduhpf · 2024-04-10T13:04:12Z

Have you looked at the /slots endpoint? I think it's all you need

I can't get a response fom server on /slots endpoint during prompt processing. It works during text generation, and reports how many tokens are left to generate, but what I would like to have is that kind of response during prompt processing.

Maybe it's already supposed to be working during prompt processsing, in which case there's probably a bug.

phymbert · 2024-04-10T13:12:07Z

Maybe it's already supposed to be working during prompt processing, in which case there's probably a bug.

It's not a bug. Prompt processing is blocking the main loop during a batch iteration. You can reduce batch size. We have also in mind to better split concurrent prompt processing in a fair use.

More info in :

stduhpf · 2024-04-10T15:08:34Z

Ok, so decreasing the batch size allows the server to respond on that endpoint between batches dring prompt processing, but /slots doesn't show report during prompt processing.

phymbert · 2024-04-10T15:27:36Z

/slots doesn't show report during prompt processing.

Which metrics do you want to see?

stduhpf · 2024-04-10T16:02:13Z

The current response json contain thes metrics.

[
    {
        "next_token": {
            "n_remain": -1,
            "n_decoded": 0,
            ...
        },
        ...
    }
]

During prompt processing, these stay at their default values of -1 and 0, and during token generation, they both get updated as the tokens get generated, so they add up to the value of n_predict.
It would be cool to have something similar, or to re-use them during prompt processing, such as they add up to the number of tokens in the prompt getting processed.

compilade · 2024-04-11T02:12:47Z

From my understanding of batch processing, this information is not knowable (though it's possible I'm misunderstanding something). During prompt processing, the prompt is split into batches of n_batch tokens (2048 by default), and batches are further split into ubatches of n_ubatch tokens (512 by default), and then each layer is computed (sequentially) over all the tokens (in parallel) in the ubatch, and so the tokens of a ubatch all "finish" processing at the same time in a single forward pass of the compute graph.

But it might still be possible to get an estimate of the progress within a ubatch with some heuristic based on how many nodes in the compute graph have been computed compared to the node count of the graph, though I don't know if that information can be extracted at all and if it can be done reliably for all backends. Maybe there's a way.

But if what you're asking is progress granulated on batch size, that should be easier.

phymbert · 2024-04-11T10:17:29Z

Maybe the CB eval approach on the server can help also:

eval-callback: Example how to use eval callback for debugging #6576

stduhpf added the enhancement New feature or request label Apr 10, 2024

phymbert added the server/webui label Apr 10, 2024

phymbert added the help wanted Extra attention is needed label Apr 11, 2024

phymbert mentioned this issue Apr 11, 2024

server : improvements and maintenance #4216

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: Add prompt processing progress endpoint? #6586

Server: Add prompt processing progress endpoint? #6586

stduhpf commented Apr 10, 2024

phymbert commented Apr 10, 2024 •

edited

Loading

stduhpf commented Apr 10, 2024

phymbert commented Apr 10, 2024 •

edited

Loading

stduhpf commented Apr 10, 2024

phymbert commented Apr 10, 2024

stduhpf commented Apr 10, 2024

compilade commented Apr 11, 2024 •

edited

Loading

phymbert commented Apr 11, 2024

Server: Add prompt processing progress endpoint? #6586

Server: Add prompt processing progress endpoint? #6586

Comments

stduhpf commented Apr 10, 2024

Feature Description

Motivation

Possible Implementation

phymbert commented Apr 10, 2024 • edited Loading

stduhpf commented Apr 10, 2024

phymbert commented Apr 10, 2024 • edited Loading

stduhpf commented Apr 10, 2024

phymbert commented Apr 10, 2024

stduhpf commented Apr 10, 2024

compilade commented Apr 11, 2024 • edited Loading

phymbert commented Apr 11, 2024

phymbert commented Apr 10, 2024 •

edited

Loading

phymbert commented Apr 10, 2024 •

edited

Loading

compilade commented Apr 11, 2024 •

edited

Loading