Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared, global RLQS client & buckets cache #34009

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

bsurber
Copy link
Contributor

@bsurber bsurber commented May 7, 2024

Commit Message:
Currently the RLQS client & bucket cache in use by the rate_limit_quota filter is set to be per-thread. This causes each client to only have visibility into a small section of the total traffic seen by the Envoy and multiplicatively increases the number of concurrent, managed streams to the RLQS backend.

This PR will merge the bucket caches to a single, shared map that is thread-safe to access and shared via TLS. Unsafe operations (namely creation of a new index in the bucket cache & setting of quota assignments from RLQS responses) are done by the main thread against a single source-of-truth, then pushed out to worker threads (again via pointer swap + TLS).

Local threads will also no longer have access to their own RLQS clients + streams. Instead, management of a single, shared RLQS stream will be done on the main thread, by a global client object. That global client object will handle the asynchronous generation & sending of RLQS UsageReports, as well as the processing of incoming RLQS Responses into actionable quota assignments for the filter worker-threads to pull from the buckets cache.

Additional Description:
The biggest TODO after submission will be supporting the reporting_interval field & handling reporting on different timers if buckets are configured with different intervals.

Risk Level: Medium

Testing:

  • New unit testing of both global & local client objects
  • New unit testing of filter logic
  • Updates to existing config unit testing
  • New integration testing for all of the moving parts.

@bsurber bsurber requested a review from yanavlasov as a code owner May 7, 2024 17:39
Copy link

Hi @bsurber, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #34009 was opened by bsurber.

see: more, trace.

@phlax
Copy link
Member

phlax commented May 7, 2024

@bsurber could you resolve the merge conflict please - i think that is what is preventing ci from working

@adisuissa
Copy link
Contributor

/assign @tyxia

@yanavlasov
Copy link
Contributor

@bsurber please fix code format. You can run the bazel run //tools/code_format:check_format -- fix or using this diff: https://dev.azure.com/cncf/envoy/_build/results?buildId=169874&view=artifacts&pathAsName=false&type=publishedArtifacts

/wait

Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! Nice work

We have been discussed this for a while. I just add some context here:
Current model is thread local model: RLQS client, quota cache etc are per thread.
The new model (that is introduced here) is global model: RLQS client, quota cache etc are per envoy instance and shared across threads

The motivation behind the global model is consistency (from RLQS server perspective in particular), but it is potentially trading off consistency with contention, especially we should be careful about high QPS multi-thread case.

It will be great to perform the load test before PR is merged. We can kick off the code review though.

@bsurber
Copy link
Contributor Author

bsurber commented May 10, 2024

Of note, the added load largely won't be on the worker threads, as they only ever touch shared resources to read a pointer from the thread-local cache, increment atomics, and potentially query a shared tokenbucket (but that's the same in the per-worker-thread model). The only new contention is that added by a) the atomics (so minimal), and b) thread-local-storage.

Instead, my main concern to test is the added load on the main thread, which has to do write operations against the cache + source-of-truth when the cache is first initialized for each bucket, when sending RLQS usage reports, and when processing RLQS responses into quota assignments then writing them into the source-of-truth + cache.

@ravenblackx
Copy link
Contributor

Looks like this needs more test coverage, and also a merge.
/wait

@bsurber
Copy link
Contributor Author

bsurber commented May 16, 2024

Ah, still slightly off the coverage limit there. (Edit: Actually, quite far off, I need to remove some defensive coding to follow Envoy style standards).

@jmarantz
Copy link
Contributor

/wait (for CI)

@adisuissa
Copy link
Contributor

Just a drive-by comment: this is a huge PR. Will it be possible to break it down to smaller PRs that can better reviewed?
Ont high-level thing is that there seems to be a large refactor happening in this PR. Maybe it's possible to start with a PR that just does the refactoring (no change to the current behavior), and gradually add PR(s) that modify/extend the functionality.

@alyssawilk
Copy link
Contributor

@tyxia PTAL?

@tyxia
Copy link
Member

tyxia commented Jun 13, 2024

@bsurber What is our current strategy/status of load test (which is the determining factor of this PR I think).

Let's sync internally on this.

@tyxia
Copy link
Member

tyxia commented Jun 13, 2024

/wait-any

@bsurber
Copy link
Contributor Author

bsurber commented Jun 25, 2024

Just a drive-by comment: this is a huge PR. Will it be possible to break it down to smaller PRs that can better reviewed? Ont high-level thing is that there seems to be a large refactor happening in this PR. Maybe it's possible to start with a PR that just does the refactoring (no change to the current behavior), and gradually add PR(s) that modify/extend the functionality.

I did aim to start with a smaller refactor but any intermediate states left the code progressively dirtier. This was mostly because the fundamental quotabucket had to be changed and the existing client class structures do not fit cleanly into a shared data + worker data design.
So rather than create a bunch of dirty code that was in an intentionally confusing state as intermediate changes by trying to reuse existing structures, I just scrapped the majority of what was there and started fresh.

@tyxia
Copy link
Member

tyxia commented Jul 1, 2024

/wait-any

i think this is waiting for our internal load testing.

Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 31, 2024
@bsurber
Copy link
Contributor Author

bsurber commented Aug 1, 2024

This is falling out of sync as other work is prioritized, but will be caught up SoonTM

Copy link

github-actions bot commented Sep 5, 2024

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 5, 2024
@tyxia tyxia added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels Sep 5, 2024
…le filter worker threads, and the client interface that the worker threads can call to for unsafe operations.

Signed-off-by: Brian Surber <[email protected]>

Create a separated local client & global client. The local client implements RateLimitClient for the local worker thread to call. The global client object performs all the thread-unsafe operations against the source-of-truth (safely, by only running them on the main thread) & pushes the results to TLS caches for the local clients to read.

Signed-off-by: Brian Surber <[email protected]>

Update filter logic to read from the local bucket cache & call to the worker thread's local rl client when write ops are needed (which get passed up to the global client)

Signed-off-by: Brian Surber <[email protected]>

Init functions & build dependencies updated to setup the newly required resources

Signed-off-by: Brian Surber <[email protected]>

Update unit testing to include testing of both client types & local filter logic, and run through full integration testing.

Signed-off-by: Brian Surber <[email protected]>

Implement action assignment expiration & fallback behaviors and abandon_action handling

Signed-off-by: Brian Surber <[email protected]>

Sync integration test changes for exercising stream restarts from commit cea046f

Signed-off-by: Brian Surber <[email protected]>
@bsurber
Copy link
Contributor Author

bsurber commented Sep 18, 2024

The branch has been sync'd and all missing features implemented, namely action expiration & fallback, and abandon action processing.

@bsurber
Copy link
Contributor Author

bsurber commented Sep 18, 2024

/retest

@bsurber
Copy link
Contributor Author

bsurber commented Sep 24, 2024

/retest

@kyessenov
Copy link
Contributor

PR review reminder @yanavlasov

@kyessenov
Copy link
Contributor

/wait

@tyxia
Copy link
Member

tyxia commented Oct 4, 2024

Just FYI, i am reviewing this PR, but it is fairly large change that will take some time.

We have discussed and agreed on the high-level direction of this PR internally as potential option. The integration test (like UG verification) and load test will be good signals to have for merging.

… CacheBuckets when going from a token_bucket assignment to a blanket rule

Signed-off-by: Brian Surber <[email protected]>
bsurber and others added 3 commits October 8, 2024 16:18
…cket when it is first hit

Signed-off-by: Brian Surber <[email protected]>
Update Global RLQS client to send immediate reports for new buckets
@bsurber
Copy link
Contributor Author

bsurber commented Oct 15, 2024

Updated to confirm to RLQS specs by having the Global client send an immediate usage report when each bucket is hit for the first time, notifying the backend to send any assignments for that bucket that may be relevant before the next usage reporting cycle (e.g. if the reporting interval is on the scale of minutes).

@tyxia
Copy link
Member

tyxia commented Oct 21, 2024

/wait

Waiting for internal tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no stalebot Disables stalebot from closing an issue waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants