Usage based Rate Limiting (Counting from response header values) #4756

arkodg · 2024-11-21T22:42:52Z

No description provided.

arkodg · 2024-11-21T22:43:43Z

zirain · 2024-11-29T00:22:38Z

arkodg · 2024-12-05T03:20:49Z

can we update the ratelimit filter in Envoy https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/other_features/global_rate_limiting#per-connection-or-per-http-request-rate-limiting to regenerate ratelimit actions attached to the route and scope it for responses (using the stage field) ?

mathetake · 2024-12-05T20:57:16Z

so the problem for ai-gateway usecase with this getting usage from header is; for streaming endpoints, the content is sent line by line and envoy shouldn't block the entire body but let it go through. Since OpenAI and AWS's streaming chat endpoint sends the usage stats at the very end of streaming, using a header as a communication medium is not suitable for that case. I think it's better to use dynamic_metadata as a communication between the producer (ai-gateway's case extproc which analyzes the streaming content) and the rate limit filter, and make the rate limit filter to subtract the usage from the budget on stream closure (not the header phase).

arkodg · 2024-12-05T21:15:21Z

thanks for explaining that @mathetake , so we can use this GH issue to track 3 items

reusing ratelimit filter in response path in Envoy Proxy
supporting ratelimiting based on response In EG
exposing metadata as a first class selector (like header and client ip) in EG

mathetake · 2024-12-05T21:54:27Z

sounds good - i will open an issue in envoyproxy/envoy to start the discussion with maintainers

Commit Message: ratelimit: option to excute action on stream done Additional Description: This adds a new option `apply_on_stream_done` to the rate limit policy corresponding to each descriptor. This basically allows to configure descriptors to be executed in a response content-aware way and do not enforce the rate limit (in other words "fire-and-forget"). Since addend can be currently controlled via metadata per descriptor, another filter can be used to set the value to reflect their intent there, for example, by using Lua or Ext Proc filters. This use case arises from the LLM API services which usually return the usage statistics in the response body. More specifically, they have "streaming" APIs whose response is a line-by-line event stream where the very last line of the response line contains the usage statistics. The lazy nature of this action is perfectly fine as in these use cases, the rate limit happens like "you are forbidden from the next time". Besides the LLM specific, I've also encountered the use case from the data center resource allocation case where the operators want to "block the computation from the next time since you used this much resources in this request". Ref: envoyproxy/gateway#4756 Risk Level: low Testing: done Docs Changes: done Release Notes: TODO Platform Specific Features: n/a --------- Signed-off-by: Takeshi Yoneda <[email protected]>

Commit Message: ratelimit: option to excute action on stream done Additional Description: This adds a new option `apply_on_stream_done` to the rate limit policy corresponding to each descriptor. This basically allows to configure descriptors to be executed in a response content-aware way and do not enforce the rate limit (in other words "fire-and-forget"). Since addend can be currently controlled via metadata per descriptor, another filter can be used to set the value to reflect their intent there, for example, by using Lua or Ext Proc filters. This use case arises from the LLM API services which usually return the usage statistics in the response body. More specifically, they have "streaming" APIs whose response is a line-by-line event stream where the very last line of the response line contains the usage statistics. The lazy nature of this action is perfectly fine as in these use cases, the rate limit happens like "you are forbidden from the next time". Besides the LLM specific, I've also encountered the use case from the data center resource allocation case where the operators want to "block the computation from the next time since you used this much resources in this request". Ref: envoyproxy/gateway#4756 Risk Level: low Testing: done Docs Changes: done Release Notes: TODO Platform Specific Features: n/a --------- Signed-off-by: Takeshi Yoneda <[email protected]> Mirrored from https://github.com/envoyproxy/envoy @ 857107b72abdf62690b7a1c69f9a3684d57f5f3e

mathetake · 2024-12-19T13:16:33Z

so the Envoy side PR has landed the main - next is EG

arkodg mentioned this issue Nov 21, 2024

Native Envoy Gateway features for enabling client side / egress Gen AI use cases #4748

Open

4 tasks

mathetake mentioned this issue Dec 6, 2024

http ratelimit: option to reduce budget on stream done envoyproxy/envoy#37548

Merged

mathetake mentioned this issue Dec 20, 2024

api: usage based rate limit API support #4957

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage based Rate Limiting (Counting from response header values) #4756

Usage based Rate Limiting (Counting from response header values) #4756

arkodg commented Nov 21, 2024

arkodg commented Nov 21, 2024

zirain commented Nov 29, 2024

arkodg commented Dec 5, 2024

mathetake commented Dec 5, 2024 •

edited

Loading

arkodg commented Dec 5, 2024

mathetake commented Dec 5, 2024 •

edited

Loading

mathetake commented Dec 19, 2024

Usage based Rate Limiting (Counting from response header values) #4756

Usage based Rate Limiting (Counting from response header values) #4756

Comments

arkodg commented Nov 21, 2024

arkodg commented Nov 21, 2024

zirain commented Nov 29, 2024

arkodg commented Dec 5, 2024

mathetake commented Dec 5, 2024 • edited Loading

arkodg commented Dec 5, 2024

mathetake commented Dec 5, 2024 • edited Loading

mathetake commented Dec 19, 2024

mathetake commented Dec 5, 2024 •

edited

Loading

mathetake commented Dec 5, 2024 •

edited

Loading