Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global ratelimiter: everything else #6141

Merged
merged 29 commits into from
Jun 28, 2024

Conversation

Groxx
Copy link
Contributor

@Groxx Groxx commented Jun 19, 2024

After too many attempts to break this apart and build different portions in self-contained ways, and running into various inter-dependent roadblocks... I just gave up and did it all at once.

Rollout plan for people who don't want or need this system

Do nothing :)

As of this PR, you'll use "disabled" and that should be as close to "no changes at all" as possible.
Soon, you'll get "local", and then you'll have some new metrics you can use (or ignore) but otherwise no behavior changes.

And that'll be it. The "global" load-balanced stuff is likely to remain opt-in.

Rollout plan for us

For deployment: any order is fine / should not behave (too) badly. Even if "global" or either shadow mode is selected on the initial deploy. Frontends will have background RatelimitUpdate request failures until History is deployed, but that'll just mean it continues to use the "local" internal fallback and that's in practice the same behavior as "local" or "disabled", just slightly noisier.

The smoothest deployment is: deploy everything on "disabled" or "local" (the default(s), so no requests are sent until deploy is done), then switch to "local-shadow-global" to warm global limiters / check that it's working, then "global" to use the global behavior.

Rolling back is just the opposite. Ideally disable things first to stop the requests, but even if you don't it should be fine.

In more detail:

  1. At merge time, this will set the "key mode" (frontend.globalRatelimiterMode) to "disabled", which gets as close as is reasonably possible to acting exactly like it did before this PR.
    • This is also effectively the panic button for the initial rollout.
  2. Once that proves to not immediately explode, switch to "local" for all keys. This will keep the current ratelimiter rates, but will start collecting and emitting ratelimiter-usage metrics, so we can make sure that doesn't explode either (and update dashboards, etc).
    • "local" will eventually become the new default and I'll remove "disabled" as it's the same behavior but I think we'll want to keep the metrics.
  3. Probably switch everything over to "local-shadow-global" so we start using the global system and emitting its metrics too, so we can make sure it doesn't seem like it'll explode / be surprisingly worse / etc.
    • pprof it to make sure running costs are in expected bounds
  4. Start switching individual domains over to "global" and lowering their RPS back to where we intend, rather than their current artificially-raised-to-mitigate-load-imbalance values.
    • This is done by making frontend.globalRatelimiterMode return "global" for keys like .*:my-domain (to catch user:my-domain, worker:my-domain, etc).
    • In the built-in dynamic configs, this looks like: constraints: {ratelimitKey: "user:my-domain"}
  5. If all goes well, we'll probably switch everyone over to "global" soonish, and we can retain "local" for edge cases that we didn't expect, where the old behavior works better.

The changes in a nutshell

(... I guess it's a coconut, given the size)

This PR includes:

  • Four separate "collection"s, which match the previous quotas.Collection usage (and are used as a drop-in replacement, though this needed a change to use interfaces).
    • This means we have four concurrent update cycles per limiting/frontend host, but they all share aggregating/history collections (which is fine because shared.GlobalKeys are namespaced).
  • A dynamic config flag to control which "mode" a key is in: disabled (old code fallthrough), local (old code with metrics), global, or x-shadow-y to use x while shadowing y.
    • These can be changed at any time and do not need restarts/etc to take effect. Old data will be cleaned up when changing modes, but the "collection" itself does not actually stop in any mode, it just effectively no-ops as needed.
    • This operates on collection-name-prefixed keys (shared.GlobalKey), so in practice we will see things like user:domain <- this is the limiter for "user" requests for that domain, e.g. StartWorkflow RPS. This allows us to roll this out per domain (suffix matches or just compare against all 4 values) and/or per type (prefix matches), so we can adjust to surprises reasonably precisely.
  • Switched quotas-related code to the new clock.Ratelimiter APIs as much as possible, which allows some simple wrappers and sharing more logic with other quotas package code.
  • Added rather quick Limiter-side garbage collection after realizing some issues with weights going super low, and it also seems like a good idea to keep data usage low in the system in general.
    • This is a semantic change over previous behavior, but seems important to have in v1.
  • A couple simple thrift types to keep the data I send through this system compact (Global ratelimiter part 3: compact request-weighted-algorithm data cadence-idl#172)
  • A PeerResolver addition to split a slice of strings into the keys-per-host that the associated data should be sent to, and a new type to make it clearer that "this is an RPC peer, not a string"
    • And exposing this a few more places to get it into the RPC package, so it can choose which hosts to contact.
  • Several new metrics/logs/dynamicconfig pieces, to monitor and control all this.
  • Bundled the RequestWeighted arguments into a struct so it's easier to keep encoding and decoding together, and pass it blindly between the two pieces of code.
    • Initially I wanted to keep all RPC-type details internal to the rpc package, and that drives some of this setup, but I'm pretty sure that doesn't make sense with a full plugin-friendly system. So this will almost certainly be moved later.

Testing

Aside from the unit tests here, I've locally run all this with the new development_instance2.yaml file, made some domains / sent some requests, watched where requests went / how weights changed / when GC occurred / etc. After some bug fixes and the "GC locally after 5 idle periods" change, it seems to be doing exactly what I want it to do, including adjusting as I start and stop the extra instance(s).

I would like to build a multi-instance cluster test (or a docker-compose.yaml at the very least) for a variety of kinds of tests, but I wasn't able to find anything that looked promising to build off, and I didn't want to spend a week figuring one out from scratch :\ I'm open to trying if someone has concrete ideas though.

Future changes, roughly in priority order

  • High-level docs are not yet updated.
    • This should be done before a release / encouraging its use publicly.
  • Currently "insufficient data" and "low total rps usage" are not handled well in this system.
    • "Insufficient data" almost certainly deserves to be handled, otherwise after a ring change the first host to call RatelimitUpdate for a migrated key will receive all the weight, which is both unfair and may allow exceeding the target RPS. Having aggregators not return data until [update interval] or similar has passed since the first update may be enough to resolve this.
    • "Low RPS" currently has some surprising edge cases like very very low weights (if more zero periods than used periods) and being less than ideal when a burst of requests occurs. Low weights seems important to resolve and may involve just preventing average RPS from dipping below 1 (or similar), and bursts could be improved by allowing hosts to use some of the "free" RPS until the next update (but we are not yet sure if we want to allow this).
  • "disabled" mode is basically a temporary safety fallback, and it should be removed.
    • "local" has better monitoring and does garbage-collection and is probably preferable in ~all cases.
  • I am not confident that these metrics/logs will give us all the observability we want, so I anticipate some changes / additions / etc.
    • Currently we have no metrics "directly" on limiters, so all existing "request was ratelimited" data is based on externally-visible behavior and will not change at all. So this PR should strictly be no worse than our existing monitoring, but I do not really think it is good enough yet.
  • There are a couple changes I'd like to make to third-party libraries:
    • golang.org/x/time/rate needs a PR for its flawed locking.
      • If/when that is accepted, clock.Ratelimiter likely should not change at all. x/time/rate will likely still allow time-rewinding, and we'll still need to wrap it and control reservation.CancelAt calls / for mocking, and that'll need essentially everything that's currently there.
    • github.com/jonboulle/clockwork is tough to use with time.Tickers and contexts, and that seems fix-able.
      • Adding a ticker.OneShotChan() API would let us know when a "receive tick -> do something -> go back to waiting" cycle completes, rather than having to sleep and hope it's long enough. Currently we have no real way to work around this.
      • clock.WithTimeout(ctx, dur) and similar seems rather obviously needed in retrospect, LOTS of time-based stuff uses context timeouts. I have a prototype built but I'm not confident that it's "good enough" to serve as a precise replacement, and we'd need to do something to ensure prod costs are either low enough to accept, or start using build tags to exclude it from prod entirely.
  • Adding a custom membership.Peer arg to history/client.RatelimitUpdate seems ideal, and is hopefully not too difficult.
  • This code is not fully plug-and-play capable right now. To allow internally-implemented algorithms / multiple algorithms / etc to be added and dynamically selectable will need some medium-small-ish more work to come up with those general structures, and a dynamic config structure to control it.
    • This will almost certainly happen, unless we somehow decide this is perfect in v1.
    • At a very high level, this is just "keep a list of registered algorithms and collections, dispatch by the Any-data's ValueType", and some changes to the rpc package to make it generic.
    • "local" should be extracted more completely from the "global"-capable system before doing this, but I suspect that'll happen pretty naturally as part of making this more plug-and-play. "local" and "global" are just two algorithms, one which doesn't use RPC.

Copy link

codecov bot commented Jun 19, 2024

Codecov Report

Attention: Patch coverage is 68.67470% with 156 lines in your changes missing coverage. Please review.

Project coverage is 72.64%. Comparing base (03d9a2e) to head (2a3b361).
Report is 2 commits behind head on master.

Current head 2a3b361 differs from pull request most recent head 1f37531

Please upload reports for the commit 1f37531 to get more accurate results.

Additional details and impacted files
Files Coverage Δ
client/history/client.go 79.65% <100.00%> (+7.25%) ⬆️
common/quotas/collection.go 100.00% <ø> (ø)
...ommon/quotas/global/collection/internal/limiter.go 96.42% <100.00%> (-3.58%) ⬇️
common/quotas/multistageratelimiter.go 88.23% <100.00%> (ø)
common/types/mapper/proto/history.go 99.23% <100.00%> (+<0.01%) ⬆️
service/frontend/config/config.go 100.00% <100.00%> (ø)
client/history/peer_resolver.go 96.72% <92.59%> (-3.28%) ⬇️
common/dynamicconfig/filter.go 46.47% <0.00%> (-2.06%) ⬇️
common/quotas/global/rpc/error.go 25.00% <25.00%> (ø)
common/quotas/global/algorithm/requestweighted.go 94.48% <60.00%> (-5.52%) ⬇️
... and 10 more

... and 53 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03d9a2e...1f37531. Read the comment docs.

@coveralls
Copy link

coveralls commented Jun 19, 2024

Pull Request Test Coverage Report for Build 01902e4f-dc36-4239-92f4-af9b6ee1bc99

Details

  • 605 of 796 (76.01%) changed or added relevant lines in 27 files are covered.
  • 142 unchanged lines in 14 files lost coverage.
  • Overall coverage decreased (-0.03%) to 71.468%

Changes Missing Coverage Covered Lines Changed/Added Lines %
client/history/peer_resolver.go 37 39 94.87%
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/metrics/tags.go 0 6 0.0%
common/quotas/global/algorithm/requestweighted.go 17 25 68.0%
common/types/mapper/thrift/any.go 17 25 68.0%
service/history/handler/handler.go 24 34 70.59%
Files with Coverage Reduction New Missed Lines %
service/history/queue/timer_queue_processor_base.go 1 77.66%
service/history/shard/context.go 2 79.13%
common/task/parallel_task_processor.go 2 93.06%
common/peerprovider/ringpopprovider/config.go 2 81.58%
common/quotas/global/collection/internal/limiter.go 2 97.56%
common/task/fifo_task_scheduler.go 2 85.57%
service/frontend/api/handler.go 2 75.62%
service/history/task/fetcher.go 3 85.57%
common/archiver/filestore/historyArchiver.go 4 80.95%
service/history/task/transfer_active_task_executor.go 4 72.77%
Totals Coverage Status
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657: -0.03%
Covered Lines: 107043
Relevant Lines: 149777

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 20, 2024

Pull Request Test Coverage Report for Build 019037db-5969-4cb6-a83d-760851588f21

Details

  • 616 of 824 (74.76%) changed or added relevant lines in 27 files are covered.
  • 136 unchanged lines in 11 files lost coverage.
  • Overall coverage decreased (-0.02%) to 71.482%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
client/history/peer_resolver.go 38 44 86.36%
common/metrics/tags.go 0 6 0.0%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
service/history/handler/handler.go 24 34 70.59%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 2 88.56%
common/quotas/global/collection/internal/limiter.go 2 97.56%
common/persistence/historyManager.go 2 66.67%
service/history/task/task.go 3 84.81%
common/task/fifo_task_scheduler.go 3 84.54%
service/history/task/timer_standby_task_executor.go 3 85.63%
service/history/task/transfer_active_task_executor.go 4 72.77%
service/history/execution/cache.go 6 74.61%
service/history/execution/mutable_state_decision_task_manager.go 8 89.18%
host/testcluster.go 16 68.73%
Totals Coverage Status
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657: -0.02%
Covered Lines: 107082
Relevant Lines: 149803

💛 - Coveralls

@Groxx Groxx changed the title testing in ci, ignore Global ratelimiter: everything else Jun 24, 2024
@coveralls
Copy link

coveralls commented Jun 24, 2024

Pull Request Test Coverage Report for Build 01904c1d-8ac5-47d2-8f14-b71d02715363

Details

  • 689 of 824 (83.62%) changed or added relevant lines in 27 files are covered.
  • 164 unchanged lines in 15 files lost coverage.
  • Overall coverage decreased (-0.001%) to 71.497%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
client/history/peer_resolver.go 38 44 86.36%
common/metrics/tags.go 0 6 0.0%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
service/history/handler/handler.go 24 34 70.59%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 2 88.56%
common/peerprovider/ringpopprovider/config.go 2 81.58%
service/matching/tasklist/task_list_manager.go 2 77.05%
common/quotas/global/collection/internal/limiter.go 2 97.56%
service/frontend/api/handler.go 2 75.62%
service/history/task/task.go 3 84.81%
service/history/task/timer_standby_task_executor.go 3 85.63%
tools/cli/admin_db_decode_thrift.go 3 69.23%
common/archiver/filestore/historyArchiver.go 4 80.95%
service/history/task/transfer_active_task_executor.go 4 72.77%
Totals Coverage Status
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657: -0.001%
Covered Lines: 107105
Relevant Lines: 149803

💛 - Coveralls

@Groxx Groxx marked this pull request as ready for review June 24, 2024 22:12
"github.com/uber/cadence/common/types"
)

func TestMapping(t *testing.T) {
Copy link
Contributor

@davidporter-id-au davidporter-id-au Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: What about round-trip tests?

Copy link
Contributor Author

@Groxx Groxx Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they only map one way, no round trips involved.

more concretely: I believe our rpc-type-round-trip tests are for things like replication and queries, where the frontend both receives a request and sends that same request out to history/matching, possibly on a different protocol. that isn't happening in this code.

return &client{
history: historyClient,
resolver: resolver,
thisHost: uuid.NewString(), // TODO: would descriptive be better? but it works, unique ensures correctness.
Copy link
Contributor

@davidporter-id-au davidporter-id-au Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure what the value of a random string is there over a host/container instance name? I find this a little odd, do we not have a way to get runtime identity for our other clients?

I don't feel super strongly, but I would just thing that you'd want consitency between service restarts (with the obv caveat that 'restart' doesn't make a lot of sense in a containerized world, but that's a different issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not have a way to get runtime identity for our other clients?

I'm honestly not sure. I couldn't find anything conclusive or even describing an attempt at a per-process unique ID.

I suspect there is one in the membership/peer/etc ringpop stuff, but I don't know the intricacies there well enough to figure out what it would be.

re container restarts: it is potentially fine since that will potentially result in the new process receiving the same pattern of requests it did before...... but there really isn't any way to know if that's true in general.

though the consequences either way are quite minor, and look pretty similar whether the host-aggregated-data is reused or lost. the only bit that's actually important is that two active hosts never choose the same value, because they won't weight fairly.


that said: this can be changed at any time, since it'll also change on every deploy. so if we do find a nicer value, it'll be trivial to adopt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid confusion you can use the hostname that's available in service resource object. Similar to uuid that hostname will change with restarts/deployments which is fine. it's just a unique identifier of the service instance.
Existing implementation lacks debug logs but when/if we introduce them it would be nice to avoid a random uuid and see hostname there.

Copy link
Contributor Author

@Groxx Groxx Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually see that value being set anywhere at all 🤔 it's read, but never assigned.

If it's literally the host name, like os.Hostname(), that's not good enough because it would mean multiple instances on the same machine share the same name. That's somewhat common in a dockerized environment, as well as dev/CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to stick IRL chat in here:
gonna stick with UUID for now. but I agree something from the ring would probably be preferable, if we can find something that is truly unique and stable.

there should be something in there, our rings depend on it, but I'm not confident that it's host:port (though that is currently unique internally) instead of some other field we might not currently be exposing. if/when we find it, switching makes plenty of sense as it should be a more shared / identifiable value.

Copy link
Contributor

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For users who're not particularly interested in this problem, who'll not attempt to roll out flipr config for the global rate-limit feature:

  • Are there any meaningful changes they should know about
  • Directionally, can they just do nothing and it'll remain-as-is for them?

@coveralls
Copy link

coveralls commented Jun 27, 2024

Pull Request Test Coverage Report for Build 01905ac0-75d9-412d-be38-63dbc29251ea

Details

  • 699 of 850 (82.24%) changed or added relevant lines in 29 files are covered.
  • 33 unchanged lines in 12 files lost coverage.
  • Overall coverage increased (+0.06%) to 71.491%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/log/tag/tags.go 9 15 60.0%
common/metrics/tags.go 3 9 33.33%
common/quotas/global/shared/keymapper.go 8 14 57.14%
client/history/peer_resolver.go 38 46 82.61%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
client/history/peer_resolver.go 1 89.89%
service/history/task/transfer_standby_task_executor.go 2 86.23%
common/cache/lru.go 2 93.01%
common/quotas/global/collection/internal/limiter.go 2 97.56%
common/task/fifo_task_scheduler.go 2 87.63%
service/frontend/api/handler.go 2 75.62%
common/membership/hashring.go 2 84.69%
service/history/handler/handler.go 3 95.65%
common/persistence/statsComputer.go 3 98.18%
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.06%
Covered Lines: 105261
Relevant Lines: 147236

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 27, 2024

Pull Request Test Coverage Report for Build 01905aec-2f03-4488-81f1-7aff8cdf3c00

Details

  • 688 of 852 (80.75%) changed or added relevant lines in 29 files are covered.
  • 43 unchanged lines in 14 files lost coverage.
  • Overall coverage increased (+0.06%) to 71.49%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/metrics/tags.go 3 9 33.33%
common/quotas/global/shared/keymapper.go 8 14 57.14%
client/history/peer_resolver.go 38 46 82.61%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
client/history/peer_resolver.go 1 89.89%
service/history/task/transfer_standby_task_executor.go 2 86.23%
common/mapq/types/policy_collection.go 2 93.06%
common/cache/lru.go 2 93.01%
common/quotas/global/collection/internal/limiter.go 2 97.56%
service/frontend/api/handler.go 2 75.74%
common/persistence/historyManager.go 2 66.67%
service/history/handler/handler.go 3 95.65%
service/history/task/transfer_active_task_executor.go 3 71.09%
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.06%
Covered Lines: 105261
Relevant Lines: 147238

💛 - Coveralls

@Groxx
Copy link
Contributor Author

Groxx commented Jun 27, 2024

@davidporter-id-au For users who're not particularly interested in this problem, who'll not attempt to roll out flipr config for the global rate-limit feature:

  • Are there any meaningful changes they should know about
  • Directionally, can they just do nothing and it'll remain-as-is for them?

Added deployment steps near the top of the commit message. Look good?

Groxx added a commit to cadence-workflow/cadence-idl that referenced this pull request Jun 27, 2024
)

Mostly-prerequisite for the final major step of building the global ratelimiter system in cadence-workflow/cadence#6141

This Thrift addition _does not_ need to be done, the system could instead exchange Protobuf / gob / JSON data.  But I've done it in Thrift because:
1. We already use Thrift rather heavily in service code, for long-term-stable data, like many database types.
2. We do not use Protobuf like ^ this _anywhere_.  This PR could begin to change that, but I feel like that has some larger ramifications to discuss before leaping for it.
3. Gob is _significantly_ larger than Thrift, and no more human-readable than Thrift or Protobuf, and it doesn't offer quite as strong protection against unintended changes (IDL files/codegen make that "must be stable" contract very explicit).

Notes otherwise include:
- i32 because more than 2 million operations within an update cycle (~3s planned) on a single host is roughly 1,000x beyond the size of ALL of our current ratelimits, and it uses half of the space of an i64.
  - To avoid roll-around issues even if this happens, the service code saturates at max-i32 rather than rolling around.  We'll just lose precise weight information across beyond-2m hosts if that happens.
- `double` is returned because it's scale-agnostic and not particularly worth squeezing further, and it allows the aggregator to be completely target-RPS-agnostic (it doesn't need limiters or that dynamic config _at all_ as it just tracks weight).
  - This could be adjusted to a... pair of ints?  Local/global RPS used, so callers can determine their weight locally?  I'm not sure if that'd be clearer or more useful, but it's an option, especially since I don't think we care about accurately tracking <1RPS (so ints are fine).
- If we decide we care a lot about data size, key strings are by far the majority of the bytes.  There are a lot of key-compaction options (most simply: a map per collection name), we could experiment a bit.

And last but not least, if we change our mind and want to move away from Thrift here:
we just need to make a new `any.ValueType` string to identify that new format, and maintain this thrift impl for as long as we want to allow transparent server upgrades.  And when we remove it, if someone still hasn't upgraded yet, they'll just fall back to local-only behavior (which is what we have used for the past several years) until the deploy finishes.  Risk is extremely low.
@coveralls
Copy link

coveralls commented Jun 27, 2024

Pull Request Test Coverage Report for Build 01905b7c-fcd3-46a3-9e31-6318be207dbc

Details

  • 684 of 848 (80.66%) changed or added relevant lines in 29 files are covered.
  • 21 unchanged lines in 10 files lost coverage.
  • Overall coverage increased (+0.07%) to 71.502%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/dynamicconfig/config.go 11 13 84.62%
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/metrics/tags.go 3 9 33.33%
common/quotas/global/shared/keymapper.go 8 14 57.14%
client/history/peer_resolver.go 38 46 82.61%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
client/history/peer_resolver.go 1 89.89%
service/history/task/transfer_standby_task_executor.go 2 86.64%
common/cache/lru.go 2 93.01%
common/quotas/global/collection/internal/limiter.go 2 97.37%
common/task/fifo_task_scheduler.go 2 85.57%
service/frontend/api/handler.go 2 75.62%
service/history/task/transfer_active_task_executor.go 2 71.17%
common/persistence/statsComputer.go 3 98.18%
common/archiver/filestore/historyArchiver.go 4 80.95%
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.07%
Covered Lines: 105274
Relevant Lines: 147232

💛 - Coveralls

func (s *configSuite) SetupSuite() {
func (s *configSuite) SetupTest() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key changes were leaking between tests. doesn't seem like that's even remotely desired in these, so now each test gets a new value.

@coveralls
Copy link

coveralls commented Jun 27, 2024

Pull Request Test Coverage Report for Build 01905bc1-d3c5-43ef-b8b1-17df18d6a6da

Details

  • 686 of 848 (80.9%) changed or added relevant lines in 29 files are covered.
  • 19 unchanged lines in 7 files lost coverage.
  • Overall coverage increased (+0.09%) to 71.52%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/metrics/tags.go 3 9 33.33%
common/quotas/global/shared/keymapper.go 8 14 57.14%
client/history/peer_resolver.go 38 46 82.61%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
service/history/handler/handler.go 20 30 66.67%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
client/history/peer_resolver.go 1 89.89%
common/cache/lru.go 2 93.01%
service/matching/tasklist/task_list_manager.go 2 76.65%
common/quotas/global/collection/internal/limiter.go 2 97.37%
common/task/fifo_task_scheduler.go 2 85.57%
service/history/shard/context.go 9 78.13%
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.09%
Covered Lines: 105301
Relevant Lines: 147232

💛 - Coveralls

// unknown filter string is likely safe to change and then should be updated here, but otherwise this ensures the logic isn't entirely position-dependent.
require.Equalf(t, "unknownFilter", filterString, "expected first filter to be 'unknownFilter', but it was %v", filterString)
} else {
assert.NotEqualf(t, UnknownFilter, ParseFilter(filterString), "failed to parse filter: %s, make sure it is in ParseFilter's switch statement", filterString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a weak validation that only checks it's not parsed as UnknownFilter but better than nothing

Copy link
Contributor Author

@Groxx Groxx Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a test that checks that the mapping is what we expect would just be a re-implementation of the func itself :\ though I can check that it's unique, since nothing enforces that currently.

I think this would all probably be better done with a map rather than a slice, so the pairing can be built up in a single hardcoded location, I just kinda don't want to make major changes here in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a uniqueness check too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair. without changing filters into a map I don't see a future proof way. it can be handled separately since this PR is quite big already

@Groxx Groxx enabled auto-merge (squash) June 28, 2024 21:45
@Groxx Groxx merged commit 239767d into cadence-workflow:master Jun 28, 2024
18 of 19 checks passed
@coveralls
Copy link

coveralls commented Jun 28, 2024

Pull Request Test Coverage Report for Build 019060c6-2bde-406c-bf4d-55ee12eb19fe

Details

  • 688 of 850 (80.94%) changed or added relevant lines in 29 files are covered.
  • 26 unchanged lines in 9 files lost coverage.
  • Overall coverage increased (+0.06%) to 71.49%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/resource/resource.go 13 15 86.67%
common/quotas/global/rpc/error.go 1 4 25.0%
service/history/resource/resource_test_utils.go 0 3 0.0%
common/resource/resource_test_utils.go 0 5 0.0%
common/metrics/tags.go 3 9 33.33%
common/quotas/global/shared/keymapper.go 8 14 57.14%
client/history/peer_resolver.go 38 46 82.61%
common/quotas/global/algorithm/requestweighted.go 22 30 73.33%
common/types/mapper/thrift/any.go 17 25 68.0%
service/history/handler/handler.go 20 30 66.67%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 89.05%
client/history/peer_resolver.go 1 89.89%
service/history/task/transfer_standby_task_executor.go 2 86.84%
common/cache/lru.go 2 93.01%
common/quotas/global/collection/internal/limiter.go 2 97.37%
service/matching/tasklist/task_list_manager.go 3 76.45%
common/task/fifo_task_scheduler.go 3 84.54%
common/persistence/statsComputer.go 3 98.18%
service/history/shard/context.go 9 78.13%
Totals Coverage Status
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df: 0.06%
Covered Lines: 105257
Relevant Lines: 147234

💛 - Coveralls

timl3136 pushed a commit to timl3136/cadence-idl that referenced this pull request Jul 16, 2024
…adence-workflow#172)

Mostly-prerequisite for the final major step of building the global ratelimiter system in cadence-workflow/cadence#6141

This Thrift addition _does not_ need to be done, the system could instead exchange Protobuf / gob / JSON data.  But I've done it in Thrift because:
1. We already use Thrift rather heavily in service code, for long-term-stable data, like many database types.
2. We do not use Protobuf like ^ this _anywhere_.  This PR could begin to change that, but I feel like that has some larger ramifications to discuss before leaping for it.
3. Gob is _significantly_ larger than Thrift, and no more human-readable than Thrift or Protobuf, and it doesn't offer quite as strong protection against unintended changes (IDL files/codegen make that "must be stable" contract very explicit).

Notes otherwise include:
- i32 because more than 2 million operations within an update cycle (~3s planned) on a single host is roughly 1,000x beyond the size of ALL of our current ratelimits, and it uses half of the space of an i64.
  - To avoid roll-around issues even if this happens, the service code saturates at max-i32 rather than rolling around.  We'll just lose precise weight information across beyond-2m hosts if that happens.
- `double` is returned because it's scale-agnostic and not particularly worth squeezing further, and it allows the aggregator to be completely target-RPS-agnostic (it doesn't need limiters or that dynamic config _at all_ as it just tracks weight).
  - This could be adjusted to a... pair of ints?  Local/global RPS used, so callers can determine their weight locally?  I'm not sure if that'd be clearer or more useful, but it's an option, especially since I don't think we care about accurately tracking <1RPS (so ints are fine).
- If we decide we care a lot about data size, key strings are by far the majority of the bytes.  There are a lot of key-compaction options (most simply: a map per collection name), we could experiment a bit.

And last but not least, if we change our mind and want to move away from Thrift here:
we just need to make a new `any.ValueType` string to identify that new format, and maintain this thrift impl for as long as we want to allow transparent server upgrades.  And when we remove it, if someone still hasn't upgraded yet, they'll just fall back to local-only behavior (which is what we have used for the past several years) until the deploy finishes.  Risk is extremely low.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants