Global ratelimiter: everything else #6141

Groxx · 2024-06-19T00:24:36Z

After too many attempts to break this apart and build different portions in self-contained ways, and running into various inter-dependent roadblocks... I just gave up and did it all at once.

Rollout plan for people who don't want or need this system

Do nothing :)

As of this PR, you'll use "disabled" and that should be as close to "no changes at all" as possible.
Soon, you'll get "local", and then you'll have some new metrics you can use (or ignore) but otherwise no behavior changes.

And that'll be it. The "global" load-balanced stuff is likely to remain opt-in.

Rollout plan for us

For deployment: any order is fine / should not behave (too) badly. Even if "global" or either shadow mode is selected on the initial deploy. Frontends will have background RatelimitUpdate request failures until History is deployed, but that'll just mean it continues to use the "local" internal fallback and that's in practice the same behavior as "local" or "disabled", just slightly noisier.

The smoothest deployment is: deploy everything on "disabled" or "local" (the default(s), so no requests are sent until deploy is done), then switch to "local-shadow-global" to warm global limiters / check that it's working, then "global" to use the global behavior.

Rolling back is just the opposite. Ideally disable things first to stop the requests, but even if you don't it should be fine.

In more detail:

At merge time, this will set the "key mode" (frontend.globalRatelimiterMode) to "disabled", which gets as close as is reasonably possible to acting exactly like it did before this PR.
- This is also effectively the panic button for the initial rollout.
Once that proves to not immediately explode, switch to "local" for all keys. This will keep the current ratelimiter rates, but will start collecting and emitting ratelimiter-usage metrics, so we can make sure that doesn't explode either (and update dashboards, etc).
- "local" will eventually become the new default and I'll remove "disabled" as it's the same behavior but I think we'll want to keep the metrics.
Probably switch everything over to "local-shadow-global" so we start using the global system and emitting its metrics too, so we can make sure it doesn't seem like it'll explode / be surprisingly worse / etc.
- pprof it to make sure running costs are in expected bounds
Start switching individual domains over to "global" and lowering their RPS back to where we intend, rather than their current artificially-raised-to-mitigate-load-imbalance values.
- This is done by making frontend.globalRatelimiterMode return "global" for keys like .*:my-domain (to catch user:my-domain, worker:my-domain, etc).
- In the built-in dynamic configs, this looks like: constraints: {ratelimitKey: "user:my-domain"}
If all goes well, we'll probably switch everyone over to "global" soonish, and we can retain "local" for edge cases that we didn't expect, where the old behavior works better.

The changes in a nutshell

(... I guess it's a coconut, given the size)

This PR includes:

Four separate "collection"s, which match the previous quotas.Collection usage (and are used as a drop-in replacement, though this needed a change to use interfaces).
- This means we have four concurrent update cycles per limiting/frontend host, but they all share aggregating/history collections (which is fine because shared.GlobalKeys are namespaced).
A dynamic config flag to control which "mode" a key is in: disabled (old code fallthrough), local (old code with metrics), global, or x-shadow-y to use x while shadowing y.
- These can be changed at any time and do not need restarts/etc to take effect. Old data will be cleaned up when changing modes, but the "collection" itself does not actually stop in any mode, it just effectively no-ops as needed.
- This operates on collection-name-prefixed keys (shared.GlobalKey), so in practice we will see things like user:domain <- this is the limiter for "user" requests for that domain, e.g. StartWorkflow RPS. This allows us to roll this out per domain (suffix matches or just compare against all 4 values) and/or per type (prefix matches), so we can adjust to surprises reasonably precisely.
Switched quotas-related code to the new clock.Ratelimiter APIs as much as possible, which allows some simple wrappers and sharing more logic with other quotas package code.
Added rather quick Limiter-side garbage collection after realizing some issues with weights going super low, and it also seems like a good idea to keep data usage low in the system in general.
- This is a semantic change over previous behavior, but seems important to have in v1.
A couple simple thrift types to keep the data I send through this system compact (Global ratelimiter part 3: compact request-weighted-algorithm data cadence-idl#172)
A PeerResolver addition to split a slice of strings into the keys-per-host that the associated data should be sent to, and a new type to make it clearer that "this is an RPC peer, not a string"
- And exposing this a few more places to get it into the RPC package, so it can choose which hosts to contact.
Several new metrics/logs/dynamicconfig pieces, to monitor and control all this.
Bundled the RequestWeighted arguments into a struct so it's easier to keep encoding and decoding together, and pass it blindly between the two pieces of code.
- Initially I wanted to keep all RPC-type details internal to the rpc package, and that drives some of this setup, but I'm pretty sure that doesn't make sense with a full plugin-friendly system. So this will almost certainly be moved later.

Testing

Aside from the unit tests here, I've locally run all this with the new development_instance2.yaml file, made some domains / sent some requests, watched where requests went / how weights changed / when GC occurred / etc. After some bug fixes and the "GC locally after 5 idle periods" change, it seems to be doing exactly what I want it to do, including adjusting as I start and stop the extra instance(s).

I would like to build a multi-instance cluster test (or a docker-compose.yaml at the very least) for a variety of kinds of tests, but I wasn't able to find anything that looked promising to build off, and I didn't want to spend a week figuring one out from scratch :\ I'm open to trying if someone has concrete ideas though.

Future changes, roughly in priority order

High-level docs are not yet updated.
- This should be done before a release / encouraging its use publicly.
Currently "insufficient data" and "low total rps usage" are not handled well in this system.
- "Insufficient data" almost certainly deserves to be handled, otherwise after a ring change the first host to call RatelimitUpdate for a migrated key will receive all the weight, which is both unfair and may allow exceeding the target RPS. Having aggregators not return data until [update interval] or similar has passed since the first update may be enough to resolve this.
- "Low RPS" currently has some surprising edge cases like very very low weights (if more zero periods than used periods) and being less than ideal when a burst of requests occurs. Low weights seems important to resolve and may involve just preventing average RPS from dipping below 1 (or similar), and bursts could be improved by allowing hosts to use some of the "free" RPS until the next update (but we are not yet sure if we want to allow this).
"disabled" mode is basically a temporary safety fallback, and it should be removed.
- "local" has better monitoring and does garbage-collection and is probably preferable in ~all cases.
I am not confident that these metrics/logs will give us all the observability we want, so I anticipate some changes / additions / etc.
- Currently we have no metrics "directly" on limiters, so all existing "request was ratelimited" data is based on externally-visible behavior and will not change at all. So this PR should strictly be no worse than our existing monitoring, but I do not really think it is good enough yet.
There are a couple changes I'd like to make to third-party libraries:
- golang.org/x/time/rate needs a PR for its flawed locking.
  - If/when that is accepted, clock.Ratelimiter likely should not change at all. x/time/rate will likely still allow time-rewinding, and we'll still need to wrap it and control reservation.CancelAt calls / for mocking, and that'll need essentially everything that's currently there.
- github.com/jonboulle/clockwork is tough to use with time.Tickers and contexts, and that seems fix-able.
  - Adding a ticker.OneShotChan() API would let us know when a "receive tick -> do something -> go back to waiting" cycle completes, rather than having to sleep and hope it's long enough. Currently we have no real way to work around this.
  - clock.WithTimeout(ctx, dur) and similar seems rather obviously needed in retrospect, LOTS of time-based stuff uses context timeouts. I have a prototype built but I'm not confident that it's "good enough" to serve as a precise replacement, and we'd need to do something to ensure prod costs are either low enough to accept, or start using build tags to exclude it from prod entirely.
Adding a custom membership.Peer arg to history/client.RatelimitUpdate seems ideal, and is hopefully not too difficult.
This code is not fully plug-and-play capable right now. To allow internally-implemented algorithms / multiple algorithms / etc to be added and dynamically selectable will need some medium-small-ish more work to come up with those general structures, and a dynamic config structure to control it.
- This will almost certainly happen, unless we somehow decide this is perfect in v1.
- At a very high level, this is just "keep a list of registered algorithms and collections, dispatch by the Any-data's ValueType", and some changes to the rpc package to make it generic.
- "local" should be extracted more completely from the "global"-capable system before doing this, but I suspect that'll happen pretty naturally as part of making this more plug-and-play. "local" and "global" are just two algorithms, one which doesn't use RPC.

… fallbacks

…t yet tested

codecov · 2024-06-19T00:39:56Z

Codecov Report

Attention: Patch coverage is 68.67470% with 156 lines in your changes missing coverage. Please review.

Project coverage is 72.64%. Comparing base (03d9a2e) to head (2a3b361).
Report is 2 commits behind head on master.

❗ Current head 2a3b361 differs from pull request most recent head 1f37531

Please upload reports for the commit 1f37531 to get more accurate results.

Additional details and impacted files

Files	Coverage Δ
client/history/client.go	`79.65% <100.00%> (+7.25%)`	⬆️
common/quotas/collection.go	`100.00% <ø> (ø)`
...ommon/quotas/global/collection/internal/limiter.go	`96.42% <100.00%> (-3.58%)`	⬇️
common/quotas/multistageratelimiter.go	`88.23% <100.00%> (ø)`
common/types/mapper/proto/history.go	`99.23% <100.00%> (+<0.01%)`	⬆️
service/frontend/config/config.go	`100.00% <100.00%> (ø)`
client/history/peer_resolver.go	`96.72% <92.59%> (-3.28%)`	⬇️
common/dynamicconfig/filter.go	`46.47% <0.00%> (-2.06%)`	⬇️
common/quotas/global/rpc/error.go	`25.00% <25.00%> (ø)`
common/quotas/global/algorithm/requestweighted.go	`94.48% <60.00%> (-5.52%)`	⬇️
... and 10 more

... and 53 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03d9a2e...1f37531. Read the comment docs.

coveralls · 2024-06-19T02:57:53Z

Pull Request Test Coverage Report for Build 01902e4f-dc36-4239-92f4-af9b6ee1bc99

Details

605 of 796 (76.01%) changed or added relevant lines in 27 files are covered.
142 unchanged lines in 14 files lost coverage.
Overall coverage decreased (-0.03%) to 71.468%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
client/history/peer_resolver.go	37	39	94.87%
common/dynamicconfig/config.go	11	13	84.62%
service/history/resource/resource.go	13	15	86.67%
common/quotas/global/rpc/error.go	1	4	25.0%
service/history/resource/resource_test_utils.go	0	3	0.0%
common/resource/resource_test_utils.go	0	5	0.0%
common/metrics/tags.go	0	6	0.0%
common/quotas/global/algorithm/requestweighted.go	17	25	68.0%
common/types/mapper/thrift/any.go	17	25	68.0%
service/history/handler/handler.go	24	34	70.59%

Files with Coverage Reduction	New Missed Lines	%
service/history/queue/timer_queue_processor_base.go	1	77.66%
service/history/shard/context.go	2	79.13%
common/task/parallel_task_processor.go	2	93.06%
common/peerprovider/ringpopprovider/config.go	2	81.58%
common/quotas/global/collection/internal/limiter.go	2	97.56%
common/task/fifo_task_scheduler.go	2	85.57%
service/frontend/api/handler.go	2	75.62%
service/history/task/fetcher.go	3	85.57%
common/archiver/filestore/historyArchiver.go	4	80.95%
service/history/task/transfer_active_task_executor.go	4	72.77%

Totals
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657:	-0.03%
Covered Lines:	107043
Relevant Lines:	149777

💛 - Coveralls

coveralls · 2024-06-20T23:26:57Z

Pull Request Test Coverage Report for Build 019037db-5969-4cb6-a83d-760851588f21

Details

616 of 824 (74.76%) changed or added relevant lines in 27 files are covered.
136 unchanged lines in 11 files lost coverage.
Overall coverage decreased (-0.02%) to 71.482%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
common/dynamicconfig/config.go	11	13	84.62%
service/history/resource/resource.go	13	15	86.67%
common/quotas/global/rpc/error.go	1	4	25.0%
service/history/resource/resource_test_utils.go	0	3	0.0%
common/resource/resource_test_utils.go	0	5	0.0%
client/history/peer_resolver.go	38	44	86.36%
common/metrics/tags.go	0	6	0.0%
common/quotas/global/algorithm/requestweighted.go	22	30	73.33%
common/types/mapper/thrift/any.go	17	25	68.0%
service/history/handler/handler.go	24	34	70.59%

Files with Coverage Reduction	New Missed Lines	%
common/task/weighted_round_robin_task_scheduler.go	2	88.56%
common/quotas/global/collection/internal/limiter.go	2	97.56%
common/persistence/historyManager.go	2	66.67%
service/history/task/task.go	3	84.81%
common/task/fifo_task_scheduler.go	3	84.54%
service/history/task/timer_standby_task_executor.go	3	85.63%
service/history/task/transfer_active_task_executor.go	4	72.77%
service/history/execution/cache.go	6	74.61%
service/history/execution/mutable_state_decision_task_manager.go	8	89.18%
host/testcluster.go	16	68.73%

Totals
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657:	-0.02%
Covered Lines:	107082
Relevant Lines:	149803

💛 - Coveralls

coveralls · 2024-06-24T21:51:07Z

Pull Request Test Coverage Report for Build 01904c1d-8ac5-47d2-8f14-b71d02715363

Details

689 of 824 (83.62%) changed or added relevant lines in 27 files are covered.
164 unchanged lines in 15 files lost coverage.
Overall coverage decreased (-0.001%) to 71.497%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
common/dynamicconfig/config.go	11	13	84.62%
service/history/resource/resource.go	13	15	86.67%
common/quotas/global/rpc/error.go	1	4	25.0%
service/history/resource/resource_test_utils.go	0	3	0.0%
common/resource/resource_test_utils.go	0	5	0.0%
client/history/peer_resolver.go	38	44	86.36%
common/metrics/tags.go	0	6	0.0%
common/quotas/global/algorithm/requestweighted.go	22	30	73.33%
common/types/mapper/thrift/any.go	17	25	68.0%
service/history/handler/handler.go	24	34	70.59%

Files with Coverage Reduction	New Missed Lines	%
common/task/weighted_round_robin_task_scheduler.go	2	88.56%
common/peerprovider/ringpopprovider/config.go	2	81.58%
service/matching/tasklist/task_list_manager.go	2	77.05%
common/quotas/global/collection/internal/limiter.go	2	97.56%
service/frontend/api/handler.go	2	75.62%
service/history/task/task.go	3	84.81%
service/history/task/timer_standby_task_executor.go	3	85.63%
tools/cli/admin_db_decode_thrift.go	3	69.23%
common/archiver/filestore/historyArchiver.go	4	80.95%
service/history/task/transfer_active_task_executor.go	4	72.77%

Totals
Change from base Build 01902d95-5925-47f2-80d9-47cee745b657:	-0.001%
Covered Lines:	107105
Relevant Lines:	149803

💛 - Coveralls

davidporter-id-au · 2024-06-27T01:19:45Z

common/quotas/global/rpc/mapping_test.go

+	"github.com/uber/cadence/common/types"
+)
+
+func TestMapping(t *testing.T) {


Nit: What about round-trip tests?

they only map one way, no round trips involved.

more concretely: I believe our rpc-type-round-trip tests are for things like replication and queries, where the frontend both receives a request and sends that same request out to history/matching, possibly on a different protocol. that isn't happening in this code.

davidporter-id-au · 2024-06-27T01:27:14Z

common/quotas/global/rpc/client.go

+	return &client{
+		history:  historyClient,
+		resolver: resolver,
+		thisHost: uuid.NewString(), // TODO: would descriptive be better?  but it works, unique ensures correctness.


Not entirely sure what the value of a random string is there over a host/container instance name? I find this a little odd, do we not have a way to get runtime identity for our other clients?

I don't feel super strongly, but I would just thing that you'd want consitency between service restarts (with the obv caveat that 'restart' doesn't make a lot of sense in a containerized world, but that's a different issue)

do we not have a way to get runtime identity for our other clients?

I'm honestly not sure. I couldn't find anything conclusive or even describing an attempt at a per-process unique ID.

I suspect there is one in the membership/peer/etc ringpop stuff, but I don't know the intricacies there well enough to figure out what it would be.

re container restarts: it is potentially fine since that will potentially result in the new process receiving the same pattern of requests it did before...... but there really isn't any way to know if that's true in general.

though the consequences either way are quite minor, and look pretty similar whether the host-aggregated-data is reused or lost. the only bit that's actually important is that two active hosts never choose the same value, because they won't weight fairly.

that said: this can be changed at any time, since it'll also change on every deploy. so if we do find a nicer value, it'll be trivial to adopt.

To avoid confusion you can use the hostname that's available in service resource object. Similar to uuid that hostname will change with restarts/deployments which is fine. it's just a unique identifier of the service instance.
Existing implementation lacks debug logs but when/if we introduce them it would be nice to avoid a random uuid and see hostname there.

I don't actually see that value being set anywhere at all 🤔 it's read, but never assigned.

If it's literally the host name, like os.Hostname(), that's not good enough because it would mean multiple instances on the same machine share the same name. That's somewhat common in a dockerized environment, as well as dev/CI.

to stick IRL chat in here:
gonna stick with UUID for now. but I agree something from the ring would probably be preferable, if we can find something that is truly unique and stable.

there should be something in there, our rings depend on it, but I'm not confident that it's host:port (though that is currently unique internally) instead of some other field we might not currently be exposing. if/when we find it, switching makes plenty of sense as it should be a more shared / identifiable value.

common/quotas/global/rpc/client.go

service/history/handler/handler.go

davidporter-id-au

For users who're not particularly interested in this problem, who'll not attempt to roll out flipr config for the global rate-limit feature:

Are there any meaningful changes they should know about
Directionally, can they just do nothing and it'll remain-as-is for them?

coveralls · 2024-06-27T18:04:48Z

Pull Request Test Coverage Report for Build 01905ac0-75d9-412d-be38-63dbc29251ea

Details

699 of 850 (82.24%) changed or added relevant lines in 29 files are covered.
33 unchanged lines in 12 files lost coverage.
Overall coverage increased (+0.06%) to 71.491%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
common/dynamicconfig/config.go	11	13	84.62%
service/history/resource/resource.go	13	15	86.67%
common/quotas/global/rpc/error.go	1	4	25.0%
service/history/resource/resource_test_utils.go	0	3	0.0%
common/resource/resource_test_utils.go	0	5	0.0%
common/log/tag/tags.go	9	15	60.0%
common/metrics/tags.go	3	9	33.33%
common/quotas/global/shared/keymapper.go	8	14	57.14%
client/history/peer_resolver.go	38	46	82.61%
common/quotas/global/algorithm/requestweighted.go	22	30	73.33%

Files with Coverage Reduction	New Missed Lines	%
common/task/weighted_round_robin_task_scheduler.go	1	89.05%
client/history/peer_resolver.go	1	89.89%
service/history/task/transfer_standby_task_executor.go	2	86.23%
common/cache/lru.go	2	93.01%
common/quotas/global/collection/internal/limiter.go	2	97.56%
common/task/fifo_task_scheduler.go	2	87.63%
service/frontend/api/handler.go	2	75.62%
common/membership/hashring.go	2	84.69%
service/history/handler/handler.go	3	95.65%
common/persistence/statsComputer.go	3	98.18%

Totals
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df:	0.06%
Covered Lines:	105261
Relevant Lines:	147236

💛 - Coveralls

coveralls · 2024-06-27T18:53:14Z

Pull Request Test Coverage Report for Build 01905aec-2f03-4488-81f1-7aff8cdf3c00

Details

688 of 852 (80.75%) changed or added relevant lines in 29 files are covered.
43 unchanged lines in 14 files lost coverage.
Overall coverage increased (+0.06%) to 71.49%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
common/dynamicconfig/config.go	11	13	84.62%
service/history/resource/resource.go	13	15	86.67%
common/quotas/global/rpc/error.go	1	4	25.0%
service/history/resource/resource_test_utils.go	0	3	0.0%
common/resource/resource_test_utils.go	0	5	0.0%
common/metrics/tags.go	3	9	33.33%
common/quotas/global/shared/keymapper.go	8	14	57.14%
client/history/peer_resolver.go	38	46	82.61%
common/quotas/global/algorithm/requestweighted.go	22	30	73.33%
common/types/mapper/thrift/any.go	17	25	68.0%

Files with Coverage Reduction	New Missed Lines	%
common/task/weighted_round_robin_task_scheduler.go	1	89.05%
client/history/peer_resolver.go	1	89.89%
service/history/task/transfer_standby_task_executor.go	2	86.23%
common/mapq/types/policy_collection.go	2	93.06%
common/cache/lru.go	2	93.01%
common/quotas/global/collection/internal/limiter.go	2	97.56%
service/frontend/api/handler.go	2	75.74%
common/persistence/historyManager.go	2	66.67%
service/history/handler/handler.go	3	95.65%
service/history/task/transfer_active_task_executor.go	3	71.09%

Totals
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df:	0.06%
Covered Lines:	105261
Relevant Lines:	147238

💛 - Coveralls

Groxx · 2024-06-27T20:36:58Z

@davidporter-id-au For users who're not particularly interested in this problem, who'll not attempt to roll out flipr config for the global rate-limit feature:

Are there any meaningful changes they should know about

Directionally, can they just do nothing and it'll remain-as-is for them?

Added deployment steps near the top of the commit message. Look good?

) Mostly-prerequisite for the final major step of building the global ratelimiter system in cadence-workflow/cadence#6141 This Thrift addition _does not_ need to be done, the system could instead exchange Protobuf / gob / JSON data. But I've done it in Thrift because: 1. We already use Thrift rather heavily in service code, for long-term-stable data, like many database types. 2. We do not use Protobuf like ^ this _anywhere_. This PR could begin to change that, but I feel like that has some larger ramifications to discuss before leaping for it. 3. Gob is _significantly_ larger than Thrift, and no more human-readable than Thrift or Protobuf, and it doesn't offer quite as strong protection against unintended changes (IDL files/codegen make that "must be stable" contract very explicit). Notes otherwise include: - i32 because more than 2 million operations within an update cycle (~3s planned) on a single host is roughly 1,000x beyond the size of ALL of our current ratelimits, and it uses half of the space of an i64. - To avoid roll-around issues even if this happens, the service code saturates at max-i32 rather than rolling around. We'll just lose precise weight information across beyond-2m hosts if that happens. - `double` is returned because it's scale-agnostic and not particularly worth squeezing further, and it allows the aggregator to be completely target-RPS-agnostic (it doesn't need limiters or that dynamic config _at all_ as it just tracks weight). - This could be adjusted to a... pair of ints? Local/global RPS used, so callers can determine their weight locally? I'm not sure if that'd be clearer or more useful, but it's an option, especially since I don't think we care about accurately tracking <1RPS (so ints are fine). - If we decide we care a lot about data size, key strings are by far the majority of the bytes. There are a lot of key-compaction options (most simply: a map per collection name), we could experiment a bit. And last but not least, if we change our mind and want to move away from Thrift here: we just need to make a new `any.ValueType` string to identify that new format, and maintain this thrift impl for as long as we want to allow transparent server upgrades. And when we remove it, if someone still hasn't upgraded yet, they'll just fall back to local-only behavior (which is what we have used for the past several years) until the deploy finishes. Risk is extremely low.

coveralls · 2024-06-27T21:28:55Z

Pull Request Test Coverage Report for Build 01905b7c-fcd3-46a3-9e31-6318be207dbc

Details

684 of 848 (80.66%) changed or added relevant lines in 29 files are covered.
21 unchanged lines in 10 files lost coverage.
Overall coverage increased (+0.07%) to 71.502%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
common/dynamicconfig/config.go	11	13	84.62%
service/history/resource/resource.go	13	15	86.67%
common/quotas/global/rpc/error.go	1	4	25.0%
service/history/resource/resource_test_utils.go	0	3	0.0%
common/resource/resource_test_utils.go	0	5	0.0%
common/metrics/tags.go	3	9	33.33%
common/quotas/global/shared/keymapper.go	8	14	57.14%
client/history/peer_resolver.go	38	46	82.61%
common/quotas/global/algorithm/requestweighted.go	22	30	73.33%
common/types/mapper/thrift/any.go	17	25	68.0%

Files with Coverage Reduction	New Missed Lines	%
common/task/weighted_round_robin_task_scheduler.go	1	89.05%
client/history/peer_resolver.go	1	89.89%
service/history/task/transfer_standby_task_executor.go	2	86.64%
common/cache/lru.go	2	93.01%
common/quotas/global/collection/internal/limiter.go	2	97.37%
common/task/fifo_task_scheduler.go	2	85.57%
service/frontend/api/handler.go	2	75.62%
service/history/task/transfer_active_task_executor.go	2	71.17%
common/persistence/statsComputer.go	3	98.18%
common/archiver/filestore/historyArchiver.go	4	80.95%

Totals
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df:	0.07%
Covered Lines:	105274
Relevant Lines:	147232

💛 - Coveralls

Groxx · 2024-06-27T22:07:52Z

common/dynamicconfig/config_test.go

-func (s *configSuite) SetupSuite() {
+func (s *configSuite) SetupTest() {


key changes were leaking between tests. doesn't seem like that's even remotely desired in these, so now each test gets a new value.

coveralls · 2024-06-27T22:44:15Z

Pull Request Test Coverage Report for Build 01905bc1-d3c5-43ef-b8b1-17df18d6a6da

Details

686 of 848 (80.9%) changed or added relevant lines in 29 files are covered.
19 unchanged lines in 7 files lost coverage.
Overall coverage increased (+0.09%) to 71.52%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
service/history/resource/resource.go	13	15	86.67%
common/quotas/global/rpc/error.go	1	4	25.0%
service/history/resource/resource_test_utils.go	0	3	0.0%
common/resource/resource_test_utils.go	0	5	0.0%
common/metrics/tags.go	3	9	33.33%
common/quotas/global/shared/keymapper.go	8	14	57.14%
client/history/peer_resolver.go	38	46	82.61%
common/quotas/global/algorithm/requestweighted.go	22	30	73.33%
common/types/mapper/thrift/any.go	17	25	68.0%
service/history/handler/handler.go	20	30	66.67%

Files with Coverage Reduction	New Missed Lines	%
common/task/weighted_round_robin_task_scheduler.go	1	89.05%
client/history/peer_resolver.go	1	89.89%
common/cache/lru.go	2	93.01%
service/matching/tasklist/task_list_manager.go	2	76.65%
common/quotas/global/collection/internal/limiter.go	2	97.37%
common/task/fifo_task_scheduler.go	2	85.57%
service/history/shard/context.go	9	78.13%

Totals
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df:	0.09%
Covered Lines:	105301
Relevant Lines:	147232

💛 - Coveralls

…time

taylanisikdemir · 2024-06-28T21:24:26Z

common/dynamicconfig/config_test.go

+			// unknown filter string is likely safe to change and then should be updated here, but otherwise this ensures the logic isn't entirely position-dependent.
+			require.Equalf(t, "unknownFilter", filterString, "expected first filter to be 'unknownFilter', but it was %v", filterString)
+		} else {
+			assert.NotEqualf(t, UnknownFilter, ParseFilter(filterString), "failed to parse filter: %s, make sure it is in ParseFilter's switch statement", filterString)


this is a weak validation that only checks it's not parsed as UnknownFilter but better than nothing

a test that checks that the mapping is what we expect would just be a re-implementation of the func itself :\ though I can check that it's unique, since nothing enforces that currently.

I think this would all probably be better done with a map rather than a slice, so the pairing can be built up in a single hardcoded location, I just kinda don't want to make major changes here in this PR.

added a uniqueness check too

that's fair. without changing filters into a map I don't see a future proof way. it can be handled separately since this PR is quite big already

coveralls · 2024-06-28T22:09:52Z

Pull Request Test Coverage Report for Build 019060c6-2bde-406c-bf4d-55ee12eb19fe

Details

688 of 850 (80.94%) changed or added relevant lines in 29 files are covered.
26 unchanged lines in 9 files lost coverage.
Overall coverage increased (+0.06%) to 71.49%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
service/history/resource/resource.go	13	15	86.67%
common/quotas/global/rpc/error.go	1	4	25.0%
service/history/resource/resource_test_utils.go	0	3	0.0%
common/resource/resource_test_utils.go	0	5	0.0%
common/metrics/tags.go	3	9	33.33%
common/quotas/global/shared/keymapper.go	8	14	57.14%
client/history/peer_resolver.go	38	46	82.61%
common/quotas/global/algorithm/requestweighted.go	22	30	73.33%
common/types/mapper/thrift/any.go	17	25	68.0%
service/history/handler/handler.go	20	30	66.67%

Files with Coverage Reduction	New Missed Lines	%
common/task/weighted_round_robin_task_scheduler.go	1	89.05%
client/history/peer_resolver.go	1	89.89%
service/history/task/transfer_standby_task_executor.go	2	86.84%
common/cache/lru.go	2	93.01%
common/quotas/global/collection/internal/limiter.go	2	97.37%
service/matching/tasklist/task_list_manager.go	3	76.45%
common/task/fifo_task_scheduler.go	3	84.54%
common/persistence/statsComputer.go	3	98.18%
service/history/shard/context.go	9	78.13%

Totals
Change from base Build 0190573d-ff12-4850-94f0-8c77deb099df:	0.06%
Covered Lines:	105257
Relevant Lines:	147234

💛 - Coveralls

…adence-workflow#172) Mostly-prerequisite for the final major step of building the global ratelimiter system in cadence-workflow/cadence#6141 This Thrift addition _does not_ need to be done, the system could instead exchange Protobuf / gob / JSON data. But I've done it in Thrift because: 1. We already use Thrift rather heavily in service code, for long-term-stable data, like many database types. 2. We do not use Protobuf like ^ this _anywhere_. This PR could begin to change that, but I feel like that has some larger ramifications to discuss before leaping for it. 3. Gob is _significantly_ larger than Thrift, and no more human-readable than Thrift or Protobuf, and it doesn't offer quite as strong protection against unintended changes (IDL files/codegen make that "must be stable" contract very explicit). Notes otherwise include: - i32 because more than 2 million operations within an update cycle (~3s planned) on a single host is roughly 1,000x beyond the size of ALL of our current ratelimits, and it uses half of the space of an i64. - To avoid roll-around issues even if this happens, the service code saturates at max-i32 rather than rolling around. We'll just lose precise weight information across beyond-2m hosts if that happens. - `double` is returned because it's scale-agnostic and not particularly worth squeezing further, and it allows the aggregator to be completely target-RPS-agnostic (it doesn't need limiters or that dynamic config _at all_ as it just tracks weight). - This could be adjusted to a... pair of ints? Local/global RPS used, so callers can determine their weight locally? I'm not sure if that'd be clearer or more useful, but it's an option, especially since I don't think we care about accurately tracking <1RPS (so ints are fine). - If we decide we care a lot about data size, key strings are by far the majority of the bytes. There are a lot of key-compaction options (most simply: a map per collection name), we could experiment a bit. And last but not least, if we change our mind and want to move away from Thrift here: we just need to make a new `any.ValueType` string to identify that new format, and maintain this thrift impl for as long as we want to allow transparent server upgrades. And when we remove it, if someone still hasn't upgraded yet, they'll just fall back to local-only behavior (which is what we have used for the past several years) until the deploy finishes. Risk is extremely low.

Groxx added 11 commits June 10, 2024 17:55

dynamic config set up, and metrics, and now need more logic to handle…

573ea38

… fallbacks

got a panic button disable going

8100d32

Merge remote-tracking branch 'origin/master' into limiter_interface

794f664

most of shadowing is probably working

4f1cccc

partial copies from rpc branch

fae6457

metrics, most rpc added, no tests yet

b6df0e5

minor cleanup/fix

9eb9ca8

minor

1d7afcf

it builds

84b5167

basic test of collection is passing, boilerplate handler built but no…

e7750ad

…t yet tested

tests pass, most things finished, maybe try running

f472646

fix metric scope location

2a3b361

Groxx added 3 commits June 20, 2024 18:51

minor fixes and behavior improvements

2b1b78e

arguable: add a "run 2 instances" helper config

816f3e5

use will-be-default ratelimiter mode by default

9361aa1

Groxx added 3 commits June 21, 2024 20:19

more tests for collection-related stuff, pretty good coverage

8c0d9c1

lint fix

9e892fd

more linting blah

62c4c9b

Groxx changed the title ~~testing in ci, ignore~~ Global ratelimiter: everything else Jun 24, 2024

Groxx marked this pull request as ready for review June 24, 2024 22:12

Groxx requested review from Shaddoll, neil-xie, davidporter-id-au, shijiesheng, agautam478 and jakobht as code owners June 24, 2024 22:12

davidporter-id-au reviewed Jun 27, 2024

View reviewed changes

common/quotas/global/rpc/client.go Outdated Show resolved Hide resolved

davidporter-id-au reviewed Jun 27, 2024

View reviewed changes

common/quotas/global/rpc/client.go Outdated Show resolved Hide resolved

davidporter-id-au reviewed Jun 27, 2024

View reviewed changes

service/history/handler/handler.go Outdated Show resolved Hide resolved

davidporter-id-au reviewed Jun 27, 2024

View reviewed changes

Groxx added 4 commits June 26, 2024 22:46

Feedback from a first review pass

5c58beb

more small touchups, test fix

7403ba9

test fix

135fdc7

Merge remote-tracking branch 'origin/master' into limiter_interface

789edf6

disable -> disabled consistency

943b935

taylanisikdemir approved these changes Jun 27, 2024

View reviewed changes

Switch to merged IDL change, minor cleanup

7618574

Groxx added 2 commits June 27, 2024 17:37

dynamic config key description improvements, minor var name consistency

f631296

coverage for new dynamic config types

0942c4c

Groxx commented Jun 27, 2024

View reviewed changes

fixing dynamic config key docs

02cff69

add missing dynamic config filter parsing, and test to catch it next …

1f37531

…time

taylanisikdemir reviewed Jun 28, 2024

View reviewed changes

also make sure no dups exist

0d7d4cb

Groxx enabled auto-merge (squash) June 28, 2024 21:45

Groxx merged commit 239767d into cadence-workflow:master Jun 28, 2024
18 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global ratelimiter: everything else #6141

Global ratelimiter: everything else #6141

Groxx commented Jun 19, 2024 •

edited

Loading

codecov bot commented Jun 19, 2024 •

edited

Loading

coveralls commented Jun 19, 2024 •

edited

Loading

coveralls commented Jun 20, 2024 •

edited

Loading

coveralls commented Jun 24, 2024 •

edited

Loading

davidporter-id-au Jun 27, 2024 •

edited

Loading

Groxx Jun 27, 2024 •

edited

Loading

davidporter-id-au Jun 27, 2024 •

edited

Loading

Groxx Jun 27, 2024

taylanisikdemir Jun 27, 2024

Groxx Jun 27, 2024 •

edited

Loading

Groxx Jun 28, 2024

davidporter-id-au left a comment

coveralls commented Jun 27, 2024 •

edited

Loading

coveralls commented Jun 27, 2024 •

edited

Loading

Groxx commented Jun 27, 2024 •

edited

Loading

coveralls commented Jun 27, 2024 •

edited

Loading

Groxx Jun 27, 2024

coveralls commented Jun 27, 2024 •

edited

Loading

taylanisikdemir Jun 28, 2024

Groxx Jun 28, 2024 •

edited

Loading

Groxx Jun 28, 2024

taylanisikdemir Jun 28, 2024

coveralls commented Jun 28, 2024 •

edited

Loading

		func (s *configSuite) SetupSuite() {
		func (s *configSuite) SetupTest() {

Global ratelimiter: everything else #6141

Global ratelimiter: everything else #6141

Conversation

Groxx commented Jun 19, 2024 • edited Loading

Rollout plan for people who don't want or need this system

Rollout plan for us

The changes in a nutshell

Testing

Future changes, roughly in priority order

codecov bot commented Jun 19, 2024 • edited Loading

Codecov Report

coveralls commented Jun 19, 2024 • edited Loading

Pull Request Test Coverage Report for Build 01902e4f-dc36-4239-92f4-af9b6ee1bc99

Details

💛 - Coveralls

coveralls commented Jun 20, 2024 • edited Loading

Pull Request Test Coverage Report for Build 019037db-5969-4cb6-a83d-760851588f21

Details

💛 - Coveralls

coveralls commented Jun 24, 2024 • edited Loading

Pull Request Test Coverage Report for Build 01904c1d-8ac5-47d2-8f14-b71d02715363

Details

💛 - Coveralls

davidporter-id-au Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

davidporter-id-au Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx Jun 27, 2024

Choose a reason for hiding this comment

taylanisikdemir Jun 27, 2024

Choose a reason for hiding this comment

Groxx Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx Jun 28, 2024

Choose a reason for hiding this comment

davidporter-id-au left a comment

Choose a reason for hiding this comment

coveralls commented Jun 27, 2024 • edited Loading

Pull Request Test Coverage Report for Build 01905ac0-75d9-412d-be38-63dbc29251ea

Details

💛 - Coveralls

coveralls commented Jun 27, 2024 • edited Loading

Pull Request Test Coverage Report for Build 01905aec-2f03-4488-81f1-7aff8cdf3c00

Details

💛 - Coveralls

Groxx commented Jun 27, 2024 • edited Loading

coveralls commented Jun 27, 2024 • edited Loading

Pull Request Test Coverage Report for Build 01905b7c-fcd3-46a3-9e31-6318be207dbc

Details

💛 - Coveralls

Groxx Jun 27, 2024

Choose a reason for hiding this comment

coveralls commented Jun 27, 2024 • edited Loading

Pull Request Test Coverage Report for Build 01905bc1-d3c5-43ef-b8b1-17df18d6a6da

Details

💛 - Coveralls

taylanisikdemir Jun 28, 2024

Choose a reason for hiding this comment

Groxx Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx Jun 28, 2024

Choose a reason for hiding this comment

taylanisikdemir Jun 28, 2024

Choose a reason for hiding this comment

coveralls commented Jun 28, 2024 • edited Loading

Pull Request Test Coverage Report for Build 019060c6-2bde-406c-bf4d-55ee12eb19fe

Details

💛 - Coveralls

Groxx commented Jun 19, 2024 •

edited

Loading

codecov bot commented Jun 19, 2024 •

edited

Loading

coveralls commented Jun 19, 2024 •

edited

Loading

coveralls commented Jun 20, 2024 •

edited

Loading

coveralls commented Jun 24, 2024 •

edited

Loading

davidporter-id-au Jun 27, 2024 •

edited

Loading

Groxx Jun 27, 2024 •

edited

Loading

davidporter-id-au Jun 27, 2024 •

edited

Loading

Groxx Jun 27, 2024 •

edited

Loading

coveralls commented Jun 27, 2024 •

edited

Loading

coveralls commented Jun 27, 2024 •

edited

Loading

Groxx commented Jun 27, 2024 •

edited

Loading

coveralls commented Jun 27, 2024 •

edited

Loading

coveralls commented Jun 27, 2024 •

edited

Loading

Groxx Jun 28, 2024 •

edited

Loading

coveralls commented Jun 28, 2024 •

edited

Loading