Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global ratelimiter part 3: compact request-weighted-algorithm data #172

Merged
merged 3 commits into from
Jun 27, 2024

Conversation

Groxx
Copy link
Contributor

@Groxx Groxx commented Jun 6, 2024

Mostly-prerequisite for the final major step of building the global ratelimiter system in cadence-workflow/cadence#6141

This Thrift addition does not need to be done, the system could instead exchange Protobuf / gob / JSON data. But I've done it in Thrift because:

  1. We already use Thrift rather heavily in service code, for long-term-stable data, like many database types.
  2. We do not use Protobuf like ^ this anywhere. This PR could begin to change that, but I feel like that has some larger ramifications to discuss before leaping for it.
  3. Gob is significantly larger than Thrift, and no more human-readable than Thrift or Protobuf, and it doesn't offer quite as strong protection against unintended changes (IDL files/codegen make that "must be stable" contract very explicit).

Notes otherwise include:

  • i32 because more than 2 million operations within an update cycle (~3s planned) on a single host is roughly 1,000x beyond the size of ALL of our current ratelimits, and it uses half of the space of an i64.
    • To avoid roll-around issues even if this happens, the service code saturates at max-i32 rather than rolling around. We'll just lose precise weight information across beyond-2m hosts if that happens.
  • double is returned because it's scale-agnostic and not particularly worth squeezing further, and it allows the aggregator to be completely target-RPS-agnostic (it doesn't need limiters or that dynamic config at all as it just tracks weight).
    • This could be adjusted to a... pair of ints? Local/global RPS used, so callers can determine their weight locally? I'm not sure if that'd be clearer or more useful, but it's an option, especially since I don't think we care about accurately tracking <1RPS (so ints are fine).
  • If we decide we care a lot about data size, key strings are by far the majority of the bytes. There are a lot of key-compaction options (most simply: a map per collection name), we could experiment a bit.

And last but not least, if we change our mind and want to move away from Thrift here:
we just need to make a new any.ValueType string to identify that new format, and maintain this thrift impl for as long as we want to allow transparent server upgrades. And when we remove it, if someone still hasn't upgraded yet, they'll just fall back to local-only behavior (which is what we have used for the past several years) until the deploy finishes. Risk is extremely low.

@Groxx Groxx marked this pull request as ready for review June 24, 2024 22:23
@Groxx Groxx changed the title add ratelimit request types Global ratelimiter part 3: compact request-weighted-algorithm request data Jun 24, 2024
@Groxx Groxx changed the title Global ratelimiter part 3: compact request-weighted-algorithm request data Global ratelimiter part 3: compact request-weighted-algorithm data Jun 24, 2024
@Groxx Groxx merged commit 12f43fe into cadence-workflow:master Jun 27, 2024
5 checks passed
@Groxx Groxx deleted the ratelimit-type branch June 27, 2024 20:46
Groxx added a commit to cadence-workflow/cadence that referenced this pull request Jun 28, 2024
After too many attempts to break this apart and build different portions in self-contained ways, and running into various inter-dependent roadblocks... I just gave up and did it all at once.

# Rollout plan for people who don't want or need this system

Do nothing :)

As of this PR, you'll use "disabled" and that should be as close to "no changes at all" as possible.
Soon, you'll get "local", and then you'll have some new metrics you can use (or ignore) but otherwise no behavior changes.

And that'll be it.  The "global" load-balanced stuff is likely to remain opt-in.

# Rollout plan for us

For deployment: any order is fine / should not behave (too) badly.  Even if "global" or either shadow mode is selected on the initial deploy.  Frontends will have background `RatelimitUpdate` request failures until History is deployed, but that'll just mean it continues to use the "local" internal fallback and that's in practice the same behavior as "local" or "disabled", just slightly noisier.

The _smoothest_ deployment is: deploy everything on "disabled" or "local" (the default(s), so no requests are sent until deploy is done), then switch to "local-shadow-global" to warm global limiters / check that it's working, then "global" to use the global behavior.  

Rolling back is just the opposite.  Ideally disable things first to stop the requests, but even if you don't it should be fine.

In more detail:

1. At merge time, this will set the "key mode" (`frontend.globalRatelimiterMode`) to "disabled", which gets as close as is reasonably possible to acting _exactly_ like it did before this PR.
   - This is also effectively the panic button for the initial rollout.
2. Once that proves to not immediately explode, switch to "local" for all keys.  This will keep the current ratelimiter rates, but will start collecting and emitting ratelimiter-usage metrics, so we can make sure that doesn't explode either (and update dashboards, etc).
   - "local" will eventually become the new default and I'll remove "disabled" as it's the same behavior but I think we'll want to keep the metrics.
3. Probably switch everything over to "local-shadow-global" so we start using the global system and emitting its metrics too, so we can make sure it doesn't seem like it'll explode / be surprisingly worse / etc.
   - pprof it to make sure running costs are in expected bounds
5. Start switching individual domains over to `"global"` and lowering their RPS back to where we intend, rather than their current artificially-raised-to-mitigate-load-imbalance values.
   - This is done by making `frontend.globalRatelimiterMode` return "global" for keys like `.*:my-domain` (to catch `user:my-domain`, `worker:my-domain`, etc).
   - In the built-in dynamic configs, this looks like: `constraints: {ratelimitKey: "user:my-domain"}`
6. If all goes well, we'll probably switch everyone over to "global" soonish, and we can retain "local" for edge cases that we didn't expect, where the old behavior works better.

# The changes in a nutshell
(... I guess it's a coconut, given the size)

This PR includes:

- Four separate "collection"s, which match the previous `quotas.Collection` usage (and are used as a drop-in replacement, though this needed a change to use interfaces).
  - This means we have four concurrent update cycles per limiting/frontend host, but they all share aggregating/history collections (which is fine because `shared.GlobalKey`s are namespaced).
- A dynamic config flag to control which "mode" a key is in: `disabled` (old code fallthrough), `local` (old code with metrics), `global`, or `x-shadow-y` to use x while shadowing y.
  - These can be changed at any time and do not need restarts/etc to take effect.  Old data will be cleaned up when changing modes, but the "collection" itself does not actually stop in any mode, it just effectively no-ops as needed.
  - This operates on collection-name-prefixed keys (`shared.GlobalKey`), so in practice we will see things like `user:domain` <- this is the limiter for "user" requests for that domain, e.g. StartWorkflow RPS.  This allows us to roll this out per domain (suffix matches or just compare against all 4 values) and/or per type (prefix matches), so we can adjust to surprises reasonably precisely.
- Switched `quotas`-related code to the new `clock.Ratelimiter` APIs as much as possible, which allows some simple wrappers and sharing more logic with other `quotas` package code.
- Added rather quick Limiter-side garbage collection after realizing some issues with weights going super low, and it also seems like a good idea to keep data usage low in the system in general.
  - This is a semantic change over previous behavior, but seems important to have in v1.
- A couple simple thrift types to keep the data I send through this system compact (cadence-workflow/cadence-idl#172)
- A PeerResolver addition to split a slice of strings into the keys-per-host that the associated data should be sent to, and a new type to make it clearer that "this is an RPC peer, not a string"
  - And exposing this a few more places to get it into the RPC package, so it can choose which hosts to contact.
- Several new metrics/logs/dynamicconfig pieces, to monitor and control all this.
- Bundled the RequestWeighted arguments into a struct so it's easier to keep encoding and decoding together, and pass it blindly between the two pieces of code.
  - Initially I wanted to keep all RPC-type details internal to the `rpc` package, and that drives some of this setup, but I'm pretty sure that doesn't make sense with a full plugin-friendly system.  So this will almost certainly be moved later.

# Testing

Aside from the unit tests here, I've locally run all this with the new `development_instance2.yaml` file, made some domains / sent some requests, watched where requests went / how weights changed / when GC occurred / etc.  After some bug fixes and the "GC locally after 5 idle periods" change, it seems to be doing exactly what I want it to do, including adjusting as I start and stop the extra instance(s).

I would *like* to build a multi-instance cluster test (or a docker-compose.yaml at the very least) for a variety of kinds of tests, but I wasn't able to find anything that looked promising to build off, and I didn't want to spend a week figuring one out from scratch :\  I'm open to trying if someone has concrete ideas though.

# Future changes, roughly in priority order

- High-level docs are not yet updated.
  - This should be done before a release / encouraging its use publicly.
- Currently "insufficient data" and "low total rps usage" are not handled well in this system.
  - "Insufficient data" almost certainly deserves to be handled, otherwise after a ring change the first host to call `RatelimitUpdate` for a migrated key will receive _all_ the weight, which is both unfair and may allow exceeding the target RPS.  Having aggregators not return data until [update interval] or similar has passed since the first update may be enough to resolve this.
  - "Low RPS" currently has some surprising edge cases like very very low weights (if more zero periods than used periods) and being less than ideal when a burst of requests occurs.  Low weights seems important to resolve and may involve just preventing average RPS from dipping below 1 (or similar), and bursts could be improved by allowing hosts to use some of the "free" RPS until the next update (but we are not yet sure if we want to allow this).
- "disabled" mode is basically a temporary safety fallback, and it should be removed. 
  - "local" has better monitoring and does garbage-collection and is probably preferable in ~all cases.
- I am not confident that these metrics/logs will give us all the observability we want, so I anticipate some changes / additions / etc.
  - Currently we have _no_ metrics "directly" on limiters, so all existing "request was ratelimited" data is based on externally-visible behavior and will not change at all.  So this PR should strictly be no _worse_ than our existing monitoring, but I do not really think it is _good enough_ yet.
- There are a couple changes I'd like to make to third-party libraries:
  - `golang.org/x/time/rate` needs a PR for its flawed locking.
    - If/when that is accepted, `clock.Ratelimiter` _likely should not change at all_.  `x/time/rate` will likely still allow time-rewinding, and we'll still need to wrap it and control `reservation.CancelAt` calls / for mocking, and that'll need essentially everything that's currently there.
  - `github.com/jonboulle/clockwork` is tough to use with time.Tickers and contexts, and that seems fix-able.
    - Adding a `ticker.OneShotChan()` API would let us know when a "receive tick -> do something -> go back to waiting" cycle completes, rather than having to sleep and hope it's long enough.  Currently we have no real way to work around this.
    - `clock.WithTimeout(ctx, dur)` and similar seems rather obviously needed in retrospect, LOTS of time-based stuff uses context timeouts.  I have a prototype built but I'm not confident that it's "good enough" to serve as a precise replacement, and we'd need to do something to ensure prod costs are either low enough to accept, or start using build tags to exclude it from prod entirely.
- Adding a custom `membership.Peer` arg to `history/client.RatelimitUpdate` seems ideal, and is hopefully not too difficult.
- This code is not fully plug-and-play capable right now.  To allow internally-implemented algorithms / multiple algorithms / etc to be added and dynamically selectable will need some medium-small-ish more work to come up with those general structures, and a dynamic config structure to control it.
  - This will almost certainly happen, unless we somehow decide this is perfect in v1.
  - At a very high level, this is just "keep a list of registered algorithms and collections, dispatch by the Any-data's `ValueType`", and some changes to the `rpc` package to make it generic.
  - "local" should be extracted more completely from the "global"-capable system before doing this, but I suspect that'll happen pretty naturally as part of making this more plug-and-play.  "local" and "global" are just two algorithms, one which doesn't use RPC.
timl3136 pushed a commit to timl3136/cadence-idl that referenced this pull request Jul 16, 2024
…adence-workflow#172)

Mostly-prerequisite for the final major step of building the global ratelimiter system in cadence-workflow/cadence#6141

This Thrift addition _does not_ need to be done, the system could instead exchange Protobuf / gob / JSON data.  But I've done it in Thrift because:
1. We already use Thrift rather heavily in service code, for long-term-stable data, like many database types.
2. We do not use Protobuf like ^ this _anywhere_.  This PR could begin to change that, but I feel like that has some larger ramifications to discuss before leaping for it.
3. Gob is _significantly_ larger than Thrift, and no more human-readable than Thrift or Protobuf, and it doesn't offer quite as strong protection against unintended changes (IDL files/codegen make that "must be stable" contract very explicit).

Notes otherwise include:
- i32 because more than 2 million operations within an update cycle (~3s planned) on a single host is roughly 1,000x beyond the size of ALL of our current ratelimits, and it uses half of the space of an i64.
  - To avoid roll-around issues even if this happens, the service code saturates at max-i32 rather than rolling around.  We'll just lose precise weight information across beyond-2m hosts if that happens.
- `double` is returned because it's scale-agnostic and not particularly worth squeezing further, and it allows the aggregator to be completely target-RPS-agnostic (it doesn't need limiters or that dynamic config _at all_ as it just tracks weight).
  - This could be adjusted to a... pair of ints?  Local/global RPS used, so callers can determine their weight locally?  I'm not sure if that'd be clearer or more useful, but it's an option, especially since I don't think we care about accurately tracking <1RPS (so ints are fine).
- If we decide we care a lot about data size, key strings are by far the majority of the bytes.  There are a lot of key-compaction options (most simply: a map per collection name), we could experiment a bit.

And last but not least, if we change our mind and want to move away from Thrift here:
we just need to make a new `any.ValueType` string to identify that new format, and maintain this thrift impl for as long as we want to allow transparent server upgrades.  And when we remove it, if someone still hasn't upgraded yet, they'll just fall back to local-only behavior (which is what we have used for the past several years) until the deploy finishes.  Risk is extremely low.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants