Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Overload] Active downstream connections resource monitor #19186

Closed

Conversation

nezdolik
Copy link
Member

@nezdolik nezdolik commented Dec 3, 2021

Signed-off-by: Kateryna Nezdolii [email protected]

Active downstream connections resource monitor based on new proactive checks in overload manager framework. Continuation of work started in this PR. After this PR is merged, we could replace existing global&per-listener connections tracking mechanism in TCP Listener and plug overload manager downstream connections resource monitor instead.
-->

Commit Message:
Additional Description:
Risk Level: Low (new extension not wired up with existing code)
Testing: Done
Docs Changes: TBD
Release Notes:
Platform Specific Features: NA
Fixes #12419

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #19186 was opened by nezdolik.

see: more, trace.

@nezdolik
Copy link
Member Author

nezdolik commented Dec 3, 2021

Working on tests

@nezdolik nezdolik marked this pull request as draft December 3, 2021 15:47
Kateryna Nezdolii added 2 commits December 7, 2021 23:10
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik nezdolik marked this pull request as ready for review December 7, 2021 23:20
@nezdolik
Copy link
Member Author

nezdolik commented Dec 7, 2021

My suggestion is not to add public docs on this monitor in this PR (until monitor is not wired up with connection tracking code within TcpListener in the next PR)

Kateryna Nezdolii added 2 commits December 7, 2021 23:28
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Member Author

nezdolik commented Dec 8, 2021

@markdroth
Copy link
Contributor

/lgtm api

@repokitteh-read-only repokitteh-read-only bot removed the api label Dec 9, 2021
Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Here's a first pass

@@ -0,0 +1,22 @@
syntax = "proto3";

package envoy.extensions.resource_monitors.downstream_connections.v3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like is not possible to specify any other version for extension than v3 according to api style guide for extensions. I've marked entire proto file for this extension as work in progress and hid it from the docs (until the third PR lands in that would make use of this new monitor in tcp listener)

int64_t maxResourceUsage() const override;

protected:
uint64_t max_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on perhaps changing this, and where it's set in the proto to int64_t instead? The pgv rule can enforce this value is > 0 (if that was the concern) Currently, there are values this could be set to where the static cast is unsafe

envoy::extensions::resource_monitors::downstream_connections::v3::DownstreamConnectionsConfig
config;
std::unique_ptr<ActiveDownstreamConnectionsResourceMonitor> monitor(
new ActiveDownstreamConnectionsResourceMonitor(config));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer make_unique (here and elsewhere)

DownstreamConnectionsConfig& config)
: max_(config.max_active_downstream_connections()), current_(0){};

bool ActiveDownstreamConnectionsResourceMonitor::tryAllocateResource(int64_t increment) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In real use cases can tryAllocateResources and tryDeallocateResources be called concurrently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, they can be called for example from multiple worker threads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do think using RAII for the resource allocated / decremented is probably the way to go to stop incorrect from being possible and leaking resources.

For example if max = 15, and originally we have current=10 with some bad calls and threads swapping I think we can up with strange results:

  1. Call to tryAlloc(10) that stops at line 20 (doesn't yet read current for the decrement)
  2. Call to deAlloc(20) => current is still 20, so we decrement and store 0 into current
  3. The first call reads current (now 0) and decrements 10, giving us current or -10 :(

Copy link
Member Author

@nezdolik nezdolik Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realised that RAII/Memento approach will not work with cross thread visibility. Various threads will be accessing tryAlloc/deAlloc resource via their thread local overload state object. The point of using atomic counter was to bypass periodic slow flushes/updates in OM (where thread local overload state is periodically updated for all threads) and instead perform faster checks from any thread relying on atomic guarantees for cross thread visibility of latest counter value.
@KBaichoo wdyt?

@phlax
Copy link
Member

phlax commented Dec 15, 2021

/wait

@KBaichoo
Copy link
Contributor

/assign @KBaichoo

Signed-off-by: Kateryna Nezdolii <[email protected]>
Kateryna Nezdolii added 3 commits December 29, 2021 09:25
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Member Author

need to increase coverage

@jmarantz
Copy link
Contributor

jmarantz commented Jan 4, 2022

/wait

Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Member Author

Code coverage for source/extensions/resource_monitors/downstream_connections is lower than limit of 96.6 (88.9)
Partially fixed coverage, still need to increase coverage up to at least 96.6.

@KBaichoo
Copy link
Contributor

The lines missing for coverage:

https://storage.googleapis.com/envoy-pr/6101a36/coverage/source/extensions/resource_monitors/downstream_connections/downstream_connections_monitor.cc.gcov.html

Seems like they might be hard to reliably cover in a test since you'd need two concurrent calls to tryDeallocateResource(int64_t decrement) that would pass the release assert, but one of which would fail to decrement.

An alternative that would avoid that issue is using an opaque object, sort of like a momento https://en.wikipedia.org/wiki/Memento_pattern

e.g.

class ResourceStore {
  ResourceStore() : count_(0);
  // Allows this object to query this value, but for this to be opaque to others
  friend ActiveDownstreamConnectionsResourceMonitor;
  private:
   int64_t capacity_;
}

Then have your factory have a call to alloc this object, and then take it with increment and decrement to increase that capacity object. Because the capacity belongs to a given resource store, we shouldn't have issue of needing to revert a decrement b/c it goes negative.

auto factory =
Registry::FactoryRegistry<Server::Configuration::ProactiveResourceMonitorFactory>::getFactory(
"envoy.resource_monitors.downstream_connections");
EXPECT_NE(factory, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should be an ASSERT as we shouldn't be continuing the test if this fails. Here and below


envoy::extensions::resource_monitors::downstream_connections::v3::DownstreamConnectionsConfig
config;
config.set_max_active_downstream_connections(std::numeric_limits<uint64_t>::max());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps using -1 (or some other more direct invalid value), could be more clear than relying on uint64_t::max() bits to be interpreted as an int64_t value of -1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

// [#not-implemented-hide:]
message DownstreamConnectionsConfig {
// Maximum threshold for global open active downstream connections, defaults to 0.
// If monitor is configured via Overload manager api and has no value set, Envoy will reject all incoming connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically with the gt annotation that you added Envoy will reject the configuration as your test https://github.com/envoyproxy/envoy/pull/19186/files#diff-1fdfe762c6263d74ffc5365cd93a44ff14c64adac898bbf1e2de53a3a91e000bR61 shows.

Is this expected to be >= 0 or > 0?

"If monitor is configured via Overload manager api and has no value set, Envoy will reject all incoming connections."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is up for discussion. Having it '>0' would require users to explicitly configure threshold for monitor (and fail/error on startup otherwise if threshold is not configured). While '>=0' can be more tricky for users, if they forget to configure the threshold, monitor will use default value 0 and reject all incoming connections. First option (require explicit value > 0) is cleaner in my opinion, although monitor will not be able to reject all incoming connections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@KBaichoo
Copy link
Contributor

/wait

@nezdolik
Copy link
Member Author

Keepalive comment, am currently on parental leave with limited capacity but will try to get this finished.

Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Member Author

nezdolik commented Feb 9, 2022

Please bear with my slowness. Applied review comments/nits, will take care of missing coverage for concurrent code block later this week.

Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making forward progress, on this even while your on leave. Also congratulations 🎉 .

This is nearly there.

// [#not-implemented-hide:]
message DownstreamConnectionsConfig {
// Maximum threshold for global open active downstream connections, defaults to 0.
// If monitor is configured via Overload manager api and has no value set, Envoy will reject all incoming connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

DownstreamConnectionsConfig& config)
: max_(config.max_active_downstream_connections()), current_(0){};

bool ActiveDownstreamConnectionsResourceMonitor::tryAllocateResource(int64_t increment) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do think using RAII for the resource allocated / decremented is probably the way to go to stop incorrect from being possible and leaking resources.

For example if max = 15, and originally we have current=10 with some bad calls and threads swapping I think we can up with strange results:

  1. Call to tryAlloc(10) that stops at line 20 (doesn't yet read current for the decrement)
  2. Call to deAlloc(20) => current is still 20, so we decrement and store 0 into current
  3. The first call reads current (now 0) and decrements 10, giving us current or -10 :(

@KBaichoo
Copy link
Contributor

/wait

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 20, 2022
@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Mar 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

overload manager: overload signals based on number of downstream connections and active requests
5 participants