Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage based Rate Limiting (Counting from response header values) #4756

Open
Tracked by #4748
arkodg opened this issue Nov 21, 2024 · 7 comments
Open
Tracked by #4748

Usage based Rate Limiting (Counting from response header values) #4756

arkodg opened this issue Nov 21, 2024 · 7 comments

Comments

@arkodg
Copy link
Contributor

arkodg commented Nov 21, 2024

No description provided.

@arkodg
Copy link
Contributor Author

arkodg commented Nov 21, 2024

cc @zirain @missBerg

@zirain
Copy link
Contributor

zirain commented Nov 29, 2024

xref: envoyproxy/ratelimit#752

@arkodg
Copy link
Contributor Author

arkodg commented Dec 5, 2024

can we update the ratelimit filter in Envoy https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/other_features/global_rate_limiting#per-connection-or-per-http-request-rate-limiting to regenerate ratelimit actions attached to the route and scope it for responses (using the stage field) ?

@mathetake
Copy link
Member

mathetake commented Dec 5, 2024

so the problem for ai-gateway usecase with this getting usage from header is; for streaming endpoints, the content is sent line by line and envoy shouldn't block the entire body but let it go through. Since OpenAI and AWS's streaming chat endpoint sends the usage stats at the very end of streaming, using a header as a communication medium is not suitable for that case. I think it's better to use dynamic_metadata as a communication between the producer (ai-gateway's case extproc which analyzes the streaming content) and the rate limit filter, and make the rate limit filter to subtract the usage from the budget on stream closure (not the header phase).

@arkodg
Copy link
Contributor Author

arkodg commented Dec 5, 2024

thanks for explaining that @mathetake , so we can use this GH issue to track 3 items

  • reusing ratelimit filter in response path in Envoy Proxy
  • supporting ratelimiting based on response In EG
  • exposing metadata as a first class selector (like header and client ip) in EG

@mathetake
Copy link
Member

mathetake commented Dec 5, 2024

sounds good - i will open an issue in envoyproxy/envoy to start the discussion with maintainers

wbpcode pushed a commit to envoyproxy/envoy that referenced this issue Dec 19, 2024
Commit Message: ratelimit: option to excute action on stream done

Additional Description:
This adds a new option `apply_on_stream_done` to the rate limit 
policy corresponding to each descriptor. This basically allows to
configure
descriptors to be executed in a response content-aware way and do not
enforce the rate limit (in other words "fire-and-forget"). Since addend 
can be currently controlled via metadata per descriptor,
another filter can be used to set the value to reflect their intent
there,
for example, by using  Lua or Ext Proc filters.

This use case arises from the LLM API services which usually return
the usage statistics in the response body. More specifically, 
they have "streaming" APIs whose response is a line-by-line event
stream where the very last line of the response line contains the 
usage statistics. The lazy nature of this action is perfectly fine
as in these use cases, the rate limit happens like "you are forbidden
from the next time".

Besides the LLM specific, I've also encountered the use case from the 
data center resource allocation case where the operators want to
"block the computation from the next time since you used this much
resources in this request".

Ref: envoyproxy/gateway#4756

Risk Level: low
Testing: done
Docs Changes: done
Release Notes: TODO
Platform Specific Features: n/a

---------

Signed-off-by: Takeshi Yoneda <[email protected]>
update-envoy bot added a commit to envoyproxy/data-plane-api that referenced this issue Dec 19, 2024
Commit Message: ratelimit: option to excute action on stream done

Additional Description:
This adds a new option `apply_on_stream_done` to the rate limit
policy corresponding to each descriptor. This basically allows to
configure
descriptors to be executed in a response content-aware way and do not
enforce the rate limit (in other words "fire-and-forget"). Since addend
can be currently controlled via metadata per descriptor,
another filter can be used to set the value to reflect their intent
there,
for example, by using  Lua or Ext Proc filters.

This use case arises from the LLM API services which usually return
the usage statistics in the response body. More specifically,
they have "streaming" APIs whose response is a line-by-line event
stream where the very last line of the response line contains the
usage statistics. The lazy nature of this action is perfectly fine
as in these use cases, the rate limit happens like "you are forbidden
from the next time".

Besides the LLM specific, I've also encountered the use case from the
data center resource allocation case where the operators want to
"block the computation from the next time since you used this much
resources in this request".

Ref: envoyproxy/gateway#4756

Risk Level: low
Testing: done
Docs Changes: done
Release Notes: TODO
Platform Specific Features: n/a

---------

Signed-off-by: Takeshi Yoneda <[email protected]>

Mirrored from https://github.com/envoyproxy/envoy @ 857107b72abdf62690b7a1c69f9a3684d57f5f3e
@mathetake
Copy link
Member

so the Envoy side PR has landed the main - next is EG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants