-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usage based Rate Limiting (Counting from response header values) #4756
Comments
xref: envoyproxy/ratelimit#752 |
can we update the ratelimit filter in Envoy https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/other_features/global_rate_limiting#per-connection-or-per-http-request-rate-limiting to regenerate ratelimit actions attached to the route and scope it for responses (using the |
so the problem for ai-gateway usecase with this getting usage from header is; for streaming endpoints, the content is sent line by line and envoy shouldn't block the entire body but let it go through. Since OpenAI and AWS's streaming chat endpoint sends the usage stats at the very end of streaming, using a header as a communication medium is not suitable for that case. I think it's better to use dynamic_metadata as a communication between the producer (ai-gateway's case extproc which analyzes the streaming content) and the rate limit filter, and make the rate limit filter to subtract the usage from the budget on stream closure (not the header phase). |
thanks for explaining that @mathetake , so we can use this GH issue to track 3 items
|
sounds good - i will open an issue in envoyproxy/envoy to start the discussion with maintainers |
Commit Message: ratelimit: option to excute action on stream done Additional Description: This adds a new option `apply_on_stream_done` to the rate limit policy corresponding to each descriptor. This basically allows to configure descriptors to be executed in a response content-aware way and do not enforce the rate limit (in other words "fire-and-forget"). Since addend can be currently controlled via metadata per descriptor, another filter can be used to set the value to reflect their intent there, for example, by using Lua or Ext Proc filters. This use case arises from the LLM API services which usually return the usage statistics in the response body. More specifically, they have "streaming" APIs whose response is a line-by-line event stream where the very last line of the response line contains the usage statistics. The lazy nature of this action is perfectly fine as in these use cases, the rate limit happens like "you are forbidden from the next time". Besides the LLM specific, I've also encountered the use case from the data center resource allocation case where the operators want to "block the computation from the next time since you used this much resources in this request". Ref: envoyproxy/gateway#4756 Risk Level: low Testing: done Docs Changes: done Release Notes: TODO Platform Specific Features: n/a --------- Signed-off-by: Takeshi Yoneda <[email protected]>
Commit Message: ratelimit: option to excute action on stream done Additional Description: This adds a new option `apply_on_stream_done` to the rate limit policy corresponding to each descriptor. This basically allows to configure descriptors to be executed in a response content-aware way and do not enforce the rate limit (in other words "fire-and-forget"). Since addend can be currently controlled via metadata per descriptor, another filter can be used to set the value to reflect their intent there, for example, by using Lua or Ext Proc filters. This use case arises from the LLM API services which usually return the usage statistics in the response body. More specifically, they have "streaming" APIs whose response is a line-by-line event stream where the very last line of the response line contains the usage statistics. The lazy nature of this action is perfectly fine as in these use cases, the rate limit happens like "you are forbidden from the next time". Besides the LLM specific, I've also encountered the use case from the data center resource allocation case where the operators want to "block the computation from the next time since you used this much resources in this request". Ref: envoyproxy/gateway#4756 Risk Level: low Testing: done Docs Changes: done Release Notes: TODO Platform Specific Features: n/a --------- Signed-off-by: Takeshi Yoneda <[email protected]> Mirrored from https://github.com/envoyproxy/envoy @ 857107b72abdf62690b7a1c69f9a3684d57f5f3e
so the Envoy side PR has landed the main - next is EG |
No description provided.
The text was updated successfully, but these errors were encountered: