Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http ratelimit: option to reduce budget on stream done #37548

Merged
merged 30 commits into from
Dec 19, 2024

apply review commments

b76e3d7
Select commit
Loading
Failed to load commit list.
Merged

http ratelimit: option to reduce budget on stream done #37548

apply review commments
b76e3d7
Select commit
Loading
Failed to load commit list.
CI (Envoy) / Envoy/Publish and verify succeeded Dec 18, 2024 in 46m 45s

Envoy/Publish and verify (success)

Check has finished

Details

Check run finished (success ✔️)

The check run can be viewed here:

Envoy/Publish and verify (pr/37548/main@b76e3d7)

Check started by

Request (pr/37548/main@b76e3d7)

mathetake @mathetake b76e3d7 #37548 merge main@9d14e5e

http ratelimit: option to reduce budget on stream done

Commit Message: ratelimit: option to excute action on stream done

Additional Description:
This adds a new option apply_on_stream_done to the rate limit
policy corresponding to each descriptor. This basically allows to configure
descriptors to be executed in a response content-aware way and do not
enforce the rate limit (in other words "fire-and-forget"). Since addend
can be currently controlled via envoy.ratelimit.hits_addend metadata,
another filter can be used to set the value to reflect their intent there,
for example, by using Lua or Ext Proc filters.

This use case arises from the LLM API services which usually return
the usage statistics in the response body. More specifically,
they have "streaming" APIs whose response is a line-by-line event
stream where the very last line of the response line contains the
usage statistics. The lazy nature of this action is perfectly fine
as in these use cases, the rate limit happens like "you are forbidden
from the next time".

Besides the LLM specific, I've also encountered the use case from the
data center resource allocation case where the operators want to
"block the computation from the next time since you used this much
resources in this request".

Ref: envoyproxy/gateway#4756

Risk Level: low
Testing: done
Docs Changes: done
Release Notes: TODO
Platform Specific Features: n/a

Environment

Request variables

Key Value
ref 5aa40f0
sha b76e3d7
pr 37548
base-sha 9d14e5e
actor mathetake @mathetake
message http ratelimit: option to reduce budget on stream done ...
started 1734519597.120959
target-branch main
trusted false
Build image

Container image/s (as used in this CI run)

Key Value
default envoyproxy/envoy-build-ubuntu:d2be0c198feda0c607fa33209da01bf737ef373f
mobile envoyproxy/envoy-build-ubuntu:mobile-d2be0c198feda0c607fa33209da01bf737ef373f
Version

Envoy version (as used in this CI run)

Key Value
major 1
minor 33
patch 0
dev true