Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: OpenAI Compatible Frontend #7561

Merged
merged 81 commits into from
Oct 11, 2024
Merged

feat: OpenAI Compatible Frontend #7561

merged 81 commits into from
Oct 11, 2024

Conversation

rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Aug 22, 2024

Description

Adds an OpenAI Compatible Frontend for Triton Inference Server as a FastAPI application using the tritonserver in-process python bindings for the following endpoints:

  • /v1/models
  • /v1/completions
  • /v1/chat/completions

Additionally there are some other observability routes exposed for convenience:

  • /metrics - Prometheus-compatible metrics from Triton Core
  • /health/ready - General health check for inference readiness, similar to NIM schema.

This is a refactor and extension of the original example by @nnshah1 here: triton-inference-server/tutorials#104 to include more thorough testing.

Refactors/Changes

  1. The file structure was refactored to make each logical component smaller, more discoverable, and more digestible. It was loosely based on the following resources:

  2. The global variables were converted to use the FastAPI app object for storing state instead, such as app.server to access the in-process tritonserver across different routes handling requests.

  3. Encapsulated info related to model objects in a ModelMetadata dataclass.

  4. Made the --tokenizer an explicit setting provided by user at startup, rather than something we try to infer based on the Triton model name used in the model repository. This is more resilient to BYO/custom models, and models not produced by the Triton CLI.

  5. Added an optional explicit --backend (request conversion format) override to better deal with edge cases like LLMs defined in custom backends, or ambiguous backends like "python" for ensemble/BLS models where the request format isn't immediately clear from the backend choice. If the explicit override is not used, the frontend will currently pick vllm request format if the targeted model's backend=vllm, otherwise it will pick backend=tensorrtllm for any other backend value.

  6. Extracted most of the Triton in-process Python API logic out of the FastAPI route definitions. The pre-initialized in-process tritonserver is attached to the FastAPI app during initialization in main.py. This will make it simpler to mock and unit test specific parts in the frontend, as well as teardown/startup the frontend during testing without stopping/starting the tritonserver each time.

  7. Updated various parts of the generated schemas/openai.py that seemed to be generated for pydantic v1 (currently using v2) to account for all of the deprecation warnings that are currently deprecated but will be fully removed in coming versions of pydantic: https://docs.pydantic.dev/latest/migration/

Testing Methodology

  1. Original set of tests were added using FastAPI's TestClient for simplicity in getting started. I later found out that this TestClient object, despite interacting with the application and logic as normal, does not actually expose the network host/ports for interaction from other clients. It would probably be good to move away from TestClient and use the OpenAIServer utility, described below, for all testing later on.
  2. OpenAI client tests were then added to show how this implementation can act as a drop-in replacement for current users of the openai python client library, which required a OpenAIServer utility class to run the application differently then the TestClient flow expects, similar to vLLM's utility here.
  3. Streaming tests were done through the OpenAI client, as the documentation around streaming with TestClient were sparse.
  4. Overall the tests were meant to add broad support for various types of clients (raw http requests, openai client, genai-perf) and broad testing for the supported parameters (test that changes in the openai-facing values produce different behavior in the backends). Deep dives into each and every parameter and exactly how they would behave for each value (ex: top_p, top_k) was not done, and I would expect more to be tested by each backend implementation.

Open Questions

  • Q: Should /v1/models list all models in the model repository, even if only some of those models are actually compatible with the completions / chat endpoints? It may be difficult to automatically infer which models are "compatible" or not and account for all scenarios, so it may nee to be explicit by the user if we want to limit which models are returned.
    • A: Let's expose a SERVED_MODEL_NAME for a cosmetic mapping from something like tensorrt_llm_bls -> meta-llama/Meta-Llama-3.1-8b-Instruct to let the server/app starter specify how clients should interact with it. They may want to hide the details of ensemble/BLS/etc. to clients behind a more convenient --served-model-name.
  • Q: What to do with the docker/Dockerfile* and general expected user-facing workflow. I added these as examples for myself to use during testing/development and included how they'd be used in the README, but ultimately we will probably be publishing this code within the respective containers and these DIY containers are likely unnecessary.\
    • A: These will be likely removed as we get closer to having something published in a container.

Open Items for this PR

  • Initial benchmarking and performance improvements, especially under high concurrency and load with vllm==0.5.3.post1
  • KServe frontend bindings integration
  • Fix issues with parameters like stop that are passed as a List when using vLLM backend.

Open Items for follow-up PRs

  • Further benchmarking and performance improvements, especially with vllm >= 0.6
  • README and Documentation improvements (DLIS-7179)
  • Support SERVED_MODEL_NAME mapping for "convenience name" -> "triton model name" (DLIS-7171)
  • Add some tests that use genai-perf to maintain compatibility and catch regressions
  • Unsupported OpenAI schema parameters like logit_bias, logprobs, n > 1, best_of. These are mostly just unexplored, and haven't yet been scoped out to see the effort involved in supporting them. (DLIS-7183, DLIS-7184)
  • Function Calling / Tools support (DLIS-7168)
  • Unsupported OpenAI schema response fields like usage (DLIS-7185)
  • Testing for LoRA/multi-LoRA (DLIS-7186)
  • Migrating TestClient tests to use the OpenAIServer utility for all testing instead

Notes

  • I left in all the commits over time if you want to watch how the code evolved from the original, and see some of the edge cases that were caught by testing such as text/event-stream headers missing for streaming, temperature being ignored by TRT-LLM BLS, content=None for streaming messages causing genai-perf errors compared to content="", certain sampling parameters being converted to incorrect types internally, etc.

rmccorm4 added 30 commits July 30, 2024 23:30
… speed up testing, next step add tests for /v1/models routes
…, need to refactor for including server startup
…ing server only once per test class, skip openai client tests
… some placeholder tests for future feature support
…oad logic to track/list all models for now, marked xfail test for known issue with TRT-LLM temperature, added logic to support testing both TRT-LLM and vLLM based on environment, added openai dep to Dockerfile but skipping openai tests for now
…add OpenAIServer utility for testing without FastAPI TestClient, rename folder from example to openai for clarity that the source code isn't an example, add some usage examples with curl, genai-perf, and openai client, add --tokenizer to main.py
…ting it at model load time. Cleanup completions tests
…a decoupled model with an empty final response, add response validation for this scenario
@@ -36,14 +36,34 @@ def _create_vllm_inference_request(
model, prompt, request: CreateChatCompletionRequest | CreateCompletionRequest
):
inputs = {}
excludes = {"model", "stream", "messages", "prompt", "echo"}
# Exclude non-sampling parameters so they aren't passed to vLLM
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: May make more sense to explicitly "include" support sampling parameters, but will be of a similar length or longer. Both approaches likely require periodic updates to either include a new sampling field, or exclude a non-sampling field.

GuanLuo
GuanLuo previously approved these changes Oct 10, 2024
.gitignore Show resolved Hide resolved
python/openai/README.md Show resolved Hide resolved
python/openai/openai_frontend/engine/__init__.py Outdated Show resolved Hide resolved
if num_responses == 2 and responses[-1].final != True:
raise Exception("Unexpected internal error with incorrect response flags")
if num_responses > 2:
raise Exception(f"Unexpected number of responses: {num_responses}, expected 1.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for later relocation.. Why breaking out these function to a separate files? There are also a bunch of helper functions in triton_engine.py. And say if there is a need to have multiple files for a particular engine implementation, I would have group them together like:

engine
   |----- triton
               |------- __init__.py
               |------- utils.py

Copy link
Contributor Author

@rmccorm4 rmccorm4 Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed the separation of the utils/helpers definitely got a little blurry/messy - a restructure like this makes sense.

I would like to further restructure later a few things as well.

I think eventually having imports in a user-facing app looking something like this would look more standardized:

from tritonserver.core import Server
from tritonserver.frontends.kserve import KServeGrpc
from tritonserver.frontends.openai import FastApiFrontend
from tritonserver.frontends.openai.engine import TritonLLMEngine

But will consider restructuring in general as a follow-up PR focused just on structure after some discussion.

@chorus-over-flanger
Copy link

Would it be possible to include /embeddings endpoint too? It would be the last missing piece for our company to try out Triton for our on-prem RAG/Agentic solutions.

strong upvote for this feature to be enabled, would be great and, indeed - complete the whole functionality

kthui
kthui previously approved these changes Oct 10, 2024
python/openai/requirements.txt Outdated Show resolved Hide resolved
@rmccorm4 rmccorm4 dismissed stale reviews from kthui and GuanLuo via 49162be October 10, 2024 18:50
stages: [pre-commit]
verbose: true
require_serial: true
# FIXME: Only run on changed files when triggered by GitHub Actions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @GuanLuo - I'll look into fixing this on GitHub Actions separately

@rmccorm4 rmccorm4 requested review from kthui and GuanLuo October 10, 2024 19:57
@rmccorm4 rmccorm4 merged commit d60aa73 into main Oct 11, 2024
3 checks passed
@rmccorm4 rmccorm4 deleted the rmccormick-openai branch October 11, 2024 23:59
@rmccorm4
Copy link
Contributor Author

Hi @chorus-over-flanger @faileon, thanks for expressing interest in the /embeddings route! It's on our radar as another feature to add to the OpenAI frontend support (chat, completions, and models) added in this PR.

Since the route itself doesn't have too many parameters defined in the spec, it may be relatively straightforward to add.

If you have any particular models you'd like to see working as an example for /embeddings, do let us know.

If you have any feedback or interest in contributing to the project, let us know as well.

Thanks!

@copasseron
Copy link

Hi @rmccorm4 ,

thanks a lot on this PR. Huge work very appreciated here.

Do you know if you're planning to add a support for Function Calling / Tools support (DLIS-7168) soon ?

@DimadonDL
Copy link

Hello,

is it already clear when the beta phase will be over?

Thanks for this very important PR!

@chorus-over-flanger
Copy link

chorus-over-flanger commented Dec 2, 2024

Hello @rmccorm4

I've been testing the OpenAI frontend of TIS with vLLM, and I'm impressed with the overall results - excellent work! I'd like to share some findings and observations,

  • TIS OpenAI frontend, make trust_remote_code configurable #7845;

  • Performance:
    When load-testing, I noticed the OpenAI frontend (TIS/vLLM) shows lower TPS compared to the Generate extension of TIS with vLLM backend. This testing was primarily conducted with quantized 7/8B models. Can provide more detailed test results if needed;

  • Regarding the embeddings endpoint:
    In my specific use case found that large model embeddings don't perform as well as current SOTA embedding models with million-scale parameters, which led to switch to a different backend. Surprisingly I noticed that vLLM recently added support for various embedding models (both billion and million parameter sizes), whereas previously it only supported larger models;

Would also request a function calling/tools support in OpenAI frontend, as well as structured_output,

huge huge work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: feat A new feature PR: test Adding missing tests or correcting existing test
Development

Successfully merging this pull request may close these issues.