-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overload manager: overload signals based on number of downstream connections and active requests #12419
Comments
cc: @nezdolik |
@nezdolik once you accept the org invite we can assign the issue. Thank you for working on this! |
Thanks @mattklein123 @antoniovicente, I believe issue assignment should work now. |
Do we have any established patterns for aggregating counts of objects across multiple workers efficiently? We probably have stats for most of the metrics we'd want to use but I don't know if the flush frequency is high enough for the overload manager to make use of them. |
It may make sense to keep an atomic counter with the aggregate number of active client connections across all listeners and clusters. It may make sense to think of that number as separate from stats. |
What about having aggregate counts in the individual worker threads via a TLS object, then having the resource monitor aggregate those using an atomic counter when the overload manager initiates a poll? |
Thanks @nezdolik! I will take a look next week. OOO for a bit. |
I think this task needs to be splitted into 3 parts:
|
Splitting sounds fine. I'm curious about the details behind each of the 3 parts you describe above. |
|
I wonder if the listener can get the thread local overload state from thread-local-storage in some way. If that approach is not possible we may want to consider plumbing OM to dispatcher creation in a way that is similar to the plumbing done in #14679 to add extra arguments to the factory method that creates dispatchers in worker threads.
|
@antoniovicente the listener would get thread local overload state but it should also be able to trigger reactive check in OM (which would recalculate overload action state on demand and post it to all interested callbacks on worker threads), so is bidirectional interaction.
The listener would register for action in overload manager via This is how I envision reactive checks in overload manager but maybe there are better ways. cc @akonradi |
The proposed above approach requires OM reference to be accessible in Tcp Listener. |
If this is the only way that workers can learn that they shouldn't make new downstream connections (and if there's another way, I missed it), that's not going to be fast enough. If we want to absolutely make sure of anything, we need a counter that is incremented and read atomically. Otherwise there's still the possibility of a race: suppose worker A calls Having a thread-local cached copy of these atomic counters sounds like a good way to get useful information in cases where we don't need perfect consistency, like if we wanted to reduce timeouts as the number of connections increased, but if there's a hard cap we want to honor, we need to pick some point along the line between "single shared atomic counter" and "allocate each worker its own quota" (with a midpoint being "workers pull from the shared counter into their local bucket then allocate from there"). nit: updateResource should take something other than a string, or should be called on some kind of handle object. String comparison (including as part of lookup or case matching) is unnecessarily expensive and we should avoid it as much as possible on the data path, including when establishing connections. Also, |
I think that the name "reactive" threw me off. I think that the check that you're proposing is a "proactive" check rather than a reactive mechanism like the current reactive check that operates based on the most recently propagated overload state information which is computed periodically. Regarding interface, I think something like Ideally you'ld query the overload manager for the proactive resource tracker during startup as part of the process of creating listeners and keep a reference to it in the listener so each update to the resource tracker does not require a map lookup by name. |
@akonradi i may have not provided enough details on suggested approach. Worker threads will not be aware of global max+global current counters, they only maintain local counters (eg per listener) and thread local state for overload action. Atomic global counter is stored in resource monitor in OM on main thread. Upon trying to accept new connection worker thread contacts OM with "increment", OM in turn propagates that increment to resource monitor which atomically increments the global counter. Resource monitor triggers recalculation of overload actions (tied to that resource) and then OM propagates new state of overload action to worker threads. |
@KBaichoo this issue is not done yet. We are 60-70% done with one of the suggested monitors. Need to plug downstream conns monitor into tcp listener and deprecate existing mechanism that tracks downstream conns. The remaining monitors are 0% done (active requests and upstream connections). |
Whoops sorry, this autoclosed when merging the PR. Reopen, thanks! |
Deprecate runtime key `overload.global_downstream_max_connections` and track global active downstream connections limit in overload manager instead. The runtime key is still usable but using it yields deprecation warning. If both mechanisms are configured (overload resource monitor and runtime key), overload manager config will be preferred. Commit Message: Additional Description: Risk Level: High (change affects request hot path) Testing: Done Docs Changes: Done Release Notes: TBD Platform Specific Features: NA Fixes #12419 Deprecated: Added deprecation note for runtime key `overload.global_downstream_max_connections`
@kyessenov could you please reopen this issue? This is an umbrella ticket for multiple tasks in overload manager. |
@KBaichoo @botengyao would it be beneficial to introduce resource monitor (3) from original suggestion to track & limit the number of sockets used for the upstreams? Is sounds useful to me. |
If you have a strong use case it could make sense, but I'm a bit more skeptical on the value of limiting the number of upstream connections globally given It is not entirely controlled by an attacker e.g. we'd have some connection reuse for various streams. |
It would be helpful to add resource monitors to track the number of sockets used for (1) downstream connections, (2) active requests and (3) possibly upstream connections in order to provide better protection against resource attacks against configured fd rlimits. Tracking these counts proxy-wide should be sufficient. The motivations behind this enhancement include a desire for more consistent configuration of resource limits and actions we can take when approaching overload, recently introduced parameters to limit the max number of client connections (globally or per listener) to protect against fd rlimit attacks, and the introduction of more graceful options to handle increases in resource usage, including the introduction of adaptive HTTP request timeouts in #11427.
Possible future enhancements:
The text was updated successfully, but these errors were encountered: