Shared, global RLQS client & buckets cache #34009

bsurber · 2024-05-07T17:39:13Z

Commit Message:
Currently the RLQS client & bucket cache in use by the rate_limit_quota filter is set to be per-thread. This causes each client to only have visibility into a small section of the total traffic seen by the Envoy and multiplicatively increases the number of concurrent, managed streams to the RLQS backend.

This PR will merge the bucket caches to a single, shared map that is thread-safe to access and shared via TLS. Unsafe operations (namely creation of a new index in the bucket cache & setting of quota assignments from RLQS responses) are done by the main thread against a single source-of-truth, then pushed out to worker threads (again via pointer swap + TLS).

Local threads will also no longer have access to their own RLQS clients + streams. Instead, management of a single, shared RLQS stream will be done on the main thread, by a global client object. That global client object will handle the asynchronous generation & sending of RLQS UsageReports, as well as the processing of incoming RLQS Responses into actionable quota assignments for the filter worker-threads to pull from the buckets cache.

Additional Description:
The biggest TODO after submission will be supporting the reporting_interval field & handling reporting on different timers if buckets are configured with different intervals.

Risk Level: Medium

Testing:

New unit testing of both global & local client objects
New unit testing of filter logic
Updates to existing config unit testing
New integration testing for all of the moving parts.

repokitteh-read-only · 2024-05-07T17:39:18Z

Hi @bsurber, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #34009 was opened by bsurber.

see: more, trace.

phlax · 2024-05-07T17:44:07Z

@bsurber could you resolve the merge conflict please - i think that is what is preventing ci from working

adisuissa · 2024-05-07T19:44:48Z

/assign @tyxia

yanavlasov · 2024-05-08T15:20:24Z

@bsurber please fix code format. You can run the bazel run //tools/code_format:check_format -- fix or using this diff: https://dev.azure.com/cncf/envoy/_build/results?buildId=169874&view=artifacts&pathAsName=false&type=publishedArtifacts

/wait

tyxia

Thank you for working on this! Nice work

We have been discussed this for a while. I just add some context here:
Current model is thread local model: RLQS client, quota cache etc are per thread.
The new model (that is introduced here) is global model: RLQS client, quota cache etc are per envoy instance and shared across threads

The motivation behind the global model is consistency (from RLQS server perspective in particular), but it is potentially trading off consistency with contention, especially we should be careful about high QPS multi-thread case.

It will be great to perform the load test before PR is merged. We can kick off the code review though.

bsurber · 2024-05-10T19:17:56Z

Of note, the added load largely won't be on the worker threads, as they only ever touch shared resources to read a pointer from the thread-local cache, increment atomics, and potentially query a shared tokenbucket (but that's the same in the per-worker-thread model). The only new contention is that added by a) the atomics (so minimal), and b) thread-local-storage.

Instead, my main concern to test is the added load on the main thread, which has to do write operations against the cache + source-of-truth when the cache is first initialized for each bucket, when sending RLQS usage reports, and when processing RLQS responses into quota assignments then writing them into the source-of-truth + cache.

ravenblackx · 2024-05-14T15:01:19Z

Looks like this needs more test coverage, and also a merge.
/wait

bsurber · 2024-05-16T19:29:44Z

Ah, still slightly off the coverage limit there. (Edit: Actually, quite far off, I need to remove some defensive coding to follow Envoy style standards).

jmarantz · 2024-05-30T21:30:33Z

/wait (for CI)

adisuissa · 2024-06-07T18:18:10Z

Just a drive-by comment: this is a huge PR. Will it be possible to break it down to smaller PRs that can better reviewed?
Ont high-level thing is that there seems to be a large refactor happening in this PR. Maybe it's possible to start with a PR that just does the refactoring (no change to the current behavior), and gradually add PR(s) that modify/extend the functionality.

alyssawilk · 2024-06-11T12:37:07Z

@tyxia PTAL?

tyxia · 2024-06-13T12:48:40Z

@bsurber What is our current strategy/status of load test (which is the determining factor of this PR I think).

Let's sync internally on this.

tyxia · 2024-06-13T12:52:50Z

/wait-any

bsurber · 2024-06-25T18:09:04Z

Just a drive-by comment: this is a huge PR. Will it be possible to break it down to smaller PRs that can better reviewed? Ont high-level thing is that there seems to be a large refactor happening in this PR. Maybe it's possible to start with a PR that just does the refactoring (no change to the current behavior), and gradually add PR(s) that modify/extend the functionality.

I did aim to start with a smaller refactor but any intermediate states left the code progressively dirtier. This was mostly because the fundamental quotabucket had to be changed and the existing client class structures do not fit cleanly into a shared data + worker data design.
So rather than create a bunch of dirty code that was in an intentionally confusing state as intermediate changes by trying to reuse existing structures, I just scrapped the majority of what was there and started fresh.

tyxia · 2024-07-01T13:03:49Z

/wait-any

i think this is waiting for our internal load testing.

github-actions · 2024-07-31T16:01:07Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

bsurber · 2024-08-01T23:37:58Z

This is falling out of sync as other work is prioritized, but will be caught up SoonTM

github-actions · 2024-09-05T20:01:13Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

…le filter worker threads, and the client interface that the worker threads can call to for unsafe operations. Signed-off-by: Brian Surber <[email protected]> Create a separated local client & global client. The local client implements RateLimitClient for the local worker thread to call. The global client object performs all the thread-unsafe operations against the source-of-truth (safely, by only running them on the main thread) & pushes the results to TLS caches for the local clients to read. Signed-off-by: Brian Surber <[email protected]> Update filter logic to read from the local bucket cache & call to the worker thread's local rl client when write ops are needed (which get passed up to the global client) Signed-off-by: Brian Surber <[email protected]> Init functions & build dependencies updated to setup the newly required resources Signed-off-by: Brian Surber <[email protected]> Update unit testing to include testing of both client types & local filter logic, and run through full integration testing. Signed-off-by: Brian Surber <[email protected]> Implement action assignment expiration & fallback behaviors and abandon_action handling Signed-off-by: Brian Surber <[email protected]> Sync integration test changes for exercising stream restarts from commit cea046f Signed-off-by: Brian Surber <[email protected]>

bsurber · 2024-09-18T01:58:48Z

The branch has been sync'd and all missing features implemented, namely action expiration & fallback, and abandon action processing.

bsurber · 2024-09-18T18:39:47Z

/retest

Signed-off-by: Brian Surber <[email protected]>

…lling_pedantic this time) Signed-off-by: Brian Surber <[email protected]>

Signed-off-by: Brian Surber <[email protected]>

… for a specific fallback behavior Signed-off-by: Brian Surber <[email protected]>

Signed-off-by: bsurber <[email protected]>

Signed-off-by: Brian Surber <[email protected]>

bsurber · 2024-09-24T22:11:17Z

/retest

kyessenov · 2024-09-30T18:23:54Z

PR review reminder @yanavlasov

kyessenov · 2024-10-04T18:37:46Z

/wait

tyxia · 2024-10-04T18:43:40Z

Just FYI, i am reviewing this PR, but it is fairly large change that will take some time.

We have discussed and agreed on the high-level direction of this PR internally as potential option. The integration test (like UG verification) and load test will be good signals to have for merging.

… CacheBuckets when going from a token_bucket assignment to a blanket rule Signed-off-by: Brian Surber <[email protected]>

Signed-off-by: bsurber <[email protected]>

…cket when it is first hit Signed-off-by: Brian Surber <[email protected]>

Update Global RLQS client to send immediate reports for new buckets

bsurber · 2024-10-15T23:47:06Z

Updated to confirm to RLQS specs by having the Global client send an immediate usage report when each bucket is hit for the first time, notifying the backend to send any assignments for that bucket that may be relevant before the next usage reporting cycle (e.g. if the reporting interval is on the scale of minutes).

Signed-off-by: Brian Surber <[email protected]>

tyxia · 2024-10-21T14:02:15Z

/wait

Waiting for internal tests

bsurber requested a review from yanavlasov as a code owner May 7, 2024 17:39

phlax assigned yanavlasov May 7, 2024

repokitteh-read-only bot assigned tyxia May 7, 2024

repokitteh-read-only bot added waiting and removed waiting labels May 8, 2024

tyxia reviewed May 10, 2024

View reviewed changes

repokitteh-read-only bot added waiting and removed waiting labels May 14, 2024

repokitteh-read-only bot added waiting and removed waiting labels May 30, 2024

repokitteh-read-only bot added the waiting:any label Jun 13, 2024

repokitteh-read-only bot removed the waiting:any label Jun 25, 2024

repokitteh-read-only bot added the waiting:any label Jul 1, 2024

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 31, 2024

repokitteh-read-only bot added the waiting label Aug 6, 2024

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 5, 2024

tyxia added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels Sep 5, 2024

repokitteh-read-only bot removed the waiting label Sep 11, 2024

bsurber force-pushed the main branch from 48936c2 to 44ba919 Compare September 18, 2024 01:56

bsurber and others added 6 commits September 18, 2024 20:22

Remove grammar to fix spelling-checker complaints

ee72d8c

Signed-off-by: Brian Surber <[email protected]>

Another attempt at meeting typo requirements (confirmed via check_spe…

3161e44

…lling_pedantic this time) Signed-off-by: Brian Surber <[email protected]>

Swap global rlqs to using a thread-safe token bucket implementation.

72a6bac

Signed-off-by: Brian Surber <[email protected]>

Improve coverage by removing an unused destructor & adding a new test…

cd0ac87

… for a specific fallback behavior Signed-off-by: Brian Surber <[email protected]>

Merge branch 'main' into main

373041a

Signed-off-by: bsurber <[email protected]>

Fix BUILD break

44fe741

Signed-off-by: Brian Surber <[email protected]>

repokitteh-read-only bot added the waiting label Oct 4, 2024

Nit: properly allow for garbage collection of token_bucket_limiter in…

b6d5bf2

… CacheBuckets when going from a token_bucket assignment to a blanket rule Signed-off-by: Brian Surber <[email protected]>

repokitteh-read-only bot removed the waiting label Oct 8, 2024

bsurber and others added 3 commits October 8, 2024 16:18

Merge branch 'main' into main

fc45437

Signed-off-by: bsurber <[email protected]>

Update the global client to send an immediate report for each RLQS bu…

7729ffe

…cket when it is first hit Signed-off-by: Brian Surber <[email protected]>

Merge pull request #2 from bsurber/update-main-to-send-immediate-reports

b686f35

Update Global RLQS client to send immediate reports for new buckets

Add missing newline at end of test file for formatting

60d2b8f

Signed-off-by: Brian Surber <[email protected]>

repokitteh-read-only bot added the waiting label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared, global RLQS client & buckets cache #34009

Shared, global RLQS client & buckets cache #34009

bsurber commented May 7, 2024

repokitteh-read-only bot commented May 7, 2024

phlax commented May 7, 2024

adisuissa commented May 7, 2024

yanavlasov commented May 8, 2024

tyxia left a comment •

edited

Loading

bsurber commented May 10, 2024

ravenblackx commented May 14, 2024

bsurber commented May 16, 2024 •

edited

Loading

jmarantz commented May 30, 2024

adisuissa commented Jun 7, 2024

alyssawilk commented Jun 11, 2024

tyxia commented Jun 13, 2024

tyxia commented Jun 13, 2024

bsurber commented Jun 25, 2024

tyxia commented Jul 1, 2024

github-actions bot commented Jul 31, 2024

bsurber commented Aug 1, 2024

github-actions bot commented Sep 5, 2024

bsurber commented Sep 18, 2024

bsurber commented Sep 18, 2024

bsurber commented Sep 24, 2024

kyessenov commented Sep 30, 2024

kyessenov commented Oct 4, 2024

tyxia commented Oct 4, 2024 •

edited

Loading

bsurber commented Oct 15, 2024

tyxia commented Oct 21, 2024

Shared, global RLQS client & buckets cache #34009

Are you sure you want to change the base?

Shared, global RLQS client & buckets cache #34009

Conversation

bsurber commented May 7, 2024

repokitteh-read-only bot commented May 7, 2024

phlax commented May 7, 2024

adisuissa commented May 7, 2024

yanavlasov commented May 8, 2024

tyxia left a comment • edited Loading

Choose a reason for hiding this comment

bsurber commented May 10, 2024

ravenblackx commented May 14, 2024

bsurber commented May 16, 2024 • edited Loading

jmarantz commented May 30, 2024

adisuissa commented Jun 7, 2024

alyssawilk commented Jun 11, 2024

tyxia commented Jun 13, 2024

tyxia commented Jun 13, 2024

bsurber commented Jun 25, 2024

tyxia commented Jul 1, 2024

github-actions bot commented Jul 31, 2024

bsurber commented Aug 1, 2024

github-actions bot commented Sep 5, 2024

bsurber commented Sep 18, 2024

bsurber commented Sep 18, 2024

bsurber commented Sep 24, 2024

kyessenov commented Sep 30, 2024

kyessenov commented Oct 4, 2024

tyxia commented Oct 4, 2024 • edited Loading

bsurber commented Oct 15, 2024

tyxia commented Oct 21, 2024

tyxia left a comment •

edited

Loading

bsurber commented May 16, 2024 •

edited

Loading

tyxia commented Oct 4, 2024 •

edited

Loading