[Feature]: Online Inference on local model with OpenAI Python SDK #8631

pesc101 · 2024-09-19T11:32:38Z

🚀 The feature, motivation and pitch

OpenAI recently provided a new endpoint batch inference (https://platform.openai.com/docs/guides/batch/overview?lang=curl). It would be nice if it works using the batch format from OpenAI but with a local model.
I created an usage Issue for that before (#8567)

Something like that:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

batch_input_file = client.files.create(
  file=open("batchinput.jsonl", "rb"),
  purpose="batch"
)

client.batches.create(
    input_file_id= batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
      "description": "nightly eval job"
    }
)

At the moment there will be an error:
NotFoundError: Error code: 404 - {'detail': 'Not Found'}

Advantages for the implementation:

vllm can be run as a docker container and function only as endpoint
It is compatible with the OpenAI Python SDK, so easier to use for newbies also the model can be easily switched from the OpenAI server to local models
Consistent workflow, if you use the docker for Chat

Alternatives

Internal Implementation:
There was a feature implemented using python -m vllm.entrypoints.openai_batch as described here (#4777), but that is not compatible with the OpenAI SDK and also not compatible with the docker setup.

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

pesc101 · 2024-09-19T11:35:17Z

Seems to me here are some bots with suspect links 👀

DarkLight1337 · 2024-09-19T12:23:06Z

cc @wuisawesome @pooyadavoodi since you two have worked on batch API

wuisawesome · 2024-09-23T20:40:56Z

but that is not compatible with the OpenAI SDK and also not compatible with the docker setup

Ooc can you say more about your docker setup? Would it unblock you to mount the directory with data into your docker container?

Fwiw, the reason I didn't implement this API originally was that I couldn't think of a way to implement the job management without either introducing foot-guns or the first stateful endpoint.

This is not the first time we've heard this request though, and it is probably worth thinking more about if it becomes a recurring theme.

pesc101 · 2024-09-25T10:01:05Z

Hey,
you can find my docker setup here: #8567.
I have mounted the directory into the docker.

Okay, I see I don't know exactly how to implement it, but I think it would improve the usability of vllm in general.
I mean, a general advantage of vllm is to use fast batch inference and I think it would be nice if it is compatible with the OpenAI SDK.

mbuet2ner · 2024-10-01T09:18:26Z

Totally agree that this is a interesting feature. Would be super nice to have something standardized here (mentioned something similar here).
We are currently leveraging Ray and the LLMClass llm.chat() for that. Really simliar to the very simple generate() example from the docs. v.06.2 even brought support for batch inference for the llm.chat().

Our current approach is as follows:

We build a FastAPI application that accepts batch.jsonl files and writes them to a blob storage. You can find the official OpenAI API specification here and can use the OpenAI Pydantic models from the SDK (which are auto-generated from the API spec) to build your endpoints and validate the data.
The API then leverages Ray via the LLMClass. The llm.chat() has a slightly different interface but the messages format and the sampling parameters are more or less identical to the OpenAI format. Then you can take the Pydantic models from the existing batch entrypoint parse the JSONL files, extract the messages, sampling parameters etc. and give it to llm.chat(). After that you can take the RequestOutput from llm.chat() and iteratively build the BatchRequestOutput and the intermediate OpenAI/vllm-adapted OpenAI Pydantic models.

I can share some WIP code on how to parse the JSONL files with Ray, loading it as BatchRequestInput and formatting the llm.chat() output as BatchRequestOutput if you want. It is a little bit complicated due to the different interfaces but it works!

github-actions · 2024-12-31T01:59:01Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

wuisawesome · 2025-01-09T00:14:55Z

boop

RyanMarten · 2025-01-10T18:11:38Z

@wuisawesome @pooyadavoodi I'm also interested in seeing this feature implemented! Having an online OpenAI Batch API endpoint on top of vLLM would be super cool.

Is there any WIP PR to check out?

pesc101 added the feature request label Sep 19, 2024

github-staff deleted a comment from AmjedKhaled165 Sep 25, 2024

github-actions bot added the stale label Dec 31, 2024

github-actions bot added unstale and removed stale labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Online Inference on local model with OpenAI Python SDK #8631

[Feature]: Online Inference on local model with OpenAI Python SDK #8631

pesc101 commented Sep 19, 2024 •

edited

Loading

pesc101 commented Sep 19, 2024

DarkLight1337 commented Sep 19, 2024 •

edited

Loading

wuisawesome commented Sep 23, 2024

pesc101 commented Sep 25, 2024

mbuet2ner commented Oct 1, 2024

github-actions bot commented Dec 31, 2024

wuisawesome commented Jan 9, 2025

RyanMarten commented Jan 10, 2025

[Feature]: Online Inference on local model with OpenAI Python SDK #8631

[Feature]: Online Inference on local model with OpenAI Python SDK #8631

Comments

pesc101 commented Sep 19, 2024 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

pesc101 commented Sep 19, 2024

DarkLight1337 commented Sep 19, 2024 • edited Loading

wuisawesome commented Sep 23, 2024

pesc101 commented Sep 25, 2024

mbuet2ner commented Oct 1, 2024

github-actions bot commented Dec 31, 2024

wuisawesome commented Jan 9, 2025

RyanMarten commented Jan 10, 2025

pesc101 commented Sep 19, 2024 •

edited

Loading

DarkLight1337 commented Sep 19, 2024 •

edited

Loading