-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Overload] Active downstream connections resource monitor #19186
[Overload] Active downstream connections resource monitor #19186
Conversation
Signed-off-by: Kateryna Nezdolii <[email protected]>
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Working on tests |
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
My suggestion is not to add public docs on this monitor in this PR (until monitor is not wired up with connection tracking code within TcpListener in the next PR) |
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
/lgtm api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Here's a first pass
@@ -0,0 +1,22 @@ | |||
syntax = "proto3"; | |||
|
|||
package envoy.extensions.resource_monitors.downstream_connections.v3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be an v3alpha if this is an alpha extension? seems like it is from https://github.com/envoyproxy/envoy/pull/19186/files#diff-5fb7b50b079c821a11f230bec1d1ab9b05256541f14f8bc5045941d301759801
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like is not possible to specify any other version for extension than v3
according to api style guide for extensions. I've marked entire proto file for this extension as work in progress and hid it from the docs (until the third PR lands in that would make use of this new monitor in tcp listener)
int64_t maxResourceUsage() const override; | ||
|
||
protected: | ||
uint64_t max_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on perhaps changing this, and where it's set in the proto to int64_t instead? The pgv rule can enforce this value is > 0 (if that was the concern) Currently, there are values this could be set to where the static cast is unsafe
envoy::extensions::resource_monitors::downstream_connections::v3::DownstreamConnectionsConfig | ||
config; | ||
std::unique_ptr<ActiveDownstreamConnectionsResourceMonitor> monitor( | ||
new ActiveDownstreamConnectionsResourceMonitor(config)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer make_unique (here and elsewhere)
DownstreamConnectionsConfig& config) | ||
: max_(config.max_active_downstream_connections()), current_(0){}; | ||
|
||
bool ActiveDownstreamConnectionsResourceMonitor::tryAllocateResource(int64_t increment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In real use cases can tryAllocateResources and tryDeallocateResources be called concurrently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, they can be called for example from multiple worker threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do think using RAII for the resource allocated / decremented is probably the way to go to stop incorrect from being possible and leaking resources.
For example if max = 15, and originally we have current=10
with some bad calls and threads swapping I think we can up with strange results:
- Call to tryAlloc(10) that stops at line 20 (doesn't yet read current for the decrement)
- Call to deAlloc(20) => current is still 20, so we decrement and store 0 into current
- The first call reads current (now 0) and decrements 10, giving us current or -10 :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realised that RAII/Memento approach will not work with cross thread visibility. Various threads will be accessing tryAlloc/deAlloc resource via their thread local overload state object. The point of using atomic counter was to bypass periodic slow flushes/updates in OM (where thread local overload state is periodically updated for all threads) and instead perform faster checks from any thread relying on atomic guarantees for cross thread visibility of latest counter value.
@KBaichoo wdyt?
/wait |
/assign @KBaichoo |
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
need to increase coverage |
/wait |
Signed-off-by: Kateryna Nezdolii <[email protected]>
|
The lines missing for coverage: Seems like they might be hard to reliably cover in a test since you'd need two concurrent calls to An alternative that would avoid that issue is using an opaque object, sort of like a momento https://en.wikipedia.org/wiki/Memento_pattern e.g.
Then have your factory have a call to alloc this object, and then take it with increment and decrement to increase that capacity object. Because the capacity belongs to a given resource store, we shouldn't have issue of needing to revert a decrement b/c it goes negative. |
auto factory = | ||
Registry::FactoryRegistry<Server::Configuration::ProactiveResourceMonitorFactory>::getFactory( | ||
"envoy.resource_monitors.downstream_connections"); | ||
EXPECT_NE(factory, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be an ASSERT as we shouldn't be continuing the test if this fails. Here and below
|
||
envoy::extensions::resource_monitors::downstream_connections::v3::DownstreamConnectionsConfig | ||
config; | ||
config.set_max_active_downstream_connections(std::numeric_limits<uint64_t>::max()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps using -1 (or some other more direct invalid value), could be more clear than relying on uint64_t::max() bits to be interpreted as an int64_t value of -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
// [#not-implemented-hide:] | ||
message DownstreamConnectionsConfig { | ||
// Maximum threshold for global open active downstream connections, defaults to 0. | ||
// If monitor is configured via Overload manager api and has no value set, Envoy will reject all incoming connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically with the gt annotation that you added Envoy will reject the configuration as your test https://github.com/envoyproxy/envoy/pull/19186/files#diff-1fdfe762c6263d74ffc5365cd93a44ff14c64adac898bbf1e2de53a3a91e000bR61 shows.
Is this expected to be >= 0 or > 0?
"If monitor is configured via Overload manager api and has no value set, Envoy will reject all incoming connections."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is up for discussion. Having it '>0' would require users to explicitly configure threshold for monitor (and fail/error on startup otherwise if threshold is not configured). While '>=0' can be more tricky for users, if they forget to configure the threshold, monitor will use default value 0 and reject all incoming connections. First option (require explicit value > 0) is cleaner in my opinion, although monitor will not be able to reject all incoming connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
/wait |
Keepalive comment, am currently on parental leave with limited capacity but will try to get this finished. |
Signed-off-by: Kateryna Nezdolii <[email protected]>
Please bear with my slowness. Applied review comments/nits, will take care of missing coverage for concurrent code block later this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making forward progress, on this even while your on leave. Also congratulations 🎉 .
This is nearly there.
// [#not-implemented-hide:] | ||
message DownstreamConnectionsConfig { | ||
// Maximum threshold for global open active downstream connections, defaults to 0. | ||
// If monitor is configured via Overload manager api and has no value set, Envoy will reject all incoming connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
DownstreamConnectionsConfig& config) | ||
: max_(config.max_active_downstream_connections()), current_(0){}; | ||
|
||
bool ActiveDownstreamConnectionsResourceMonitor::tryAllocateResource(int64_t increment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do think using RAII for the resource allocated / decremented is probably the way to go to stop incorrect from being possible and leaking resources.
For example if max = 15, and originally we have current=10
with some bad calls and threads swapping I think we can up with strange results:
- Call to tryAlloc(10) that stops at line 20 (doesn't yet read current for the decrement)
- Call to deAlloc(20) => current is still 20, so we decrement and store 0 into current
- The first call reads current (now 0) and decrements 10, giving us current or -10 :(
/wait |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Kateryna Nezdolii [email protected]
Active downstream connections resource monitor based on new proactive checks in overload manager framework. Continuation of work started in this PR. After this PR is merged, we could replace existing global&per-listener connections tracking mechanism in TCP Listener and plug overload manager downstream connections resource monitor instead.
-->
Commit Message:
Additional Description:
Risk Level: Low (new extension not wired up with existing code)
Testing: Done
Docs Changes: TBD
Release Notes:
Platform Specific Features: NA
Fixes #12419