[Overload] Active downstream connections resource monitor #19186

nezdolik · 2021-12-03T15:46:31Z

Signed-off-by: Kateryna Nezdolii [email protected]

Active downstream connections resource monitor based on new proactive checks in overload manager framework. Continuation of work started in this PR. After this PR is merged, we could replace existing global&per-listener connections tracking mechanism in TCP Listener and plug overload manager downstream connections resource monitor instead.
-->

Commit Message:
Additional Description:
Risk Level: Low (new extension not wired up with existing code)
Testing: Done
Docs Changes: TBD
Release Notes:
Platform Specific Features: NA
Fixes #12419

Signed-off-by: Kateryna Nezdolii <[email protected]>

repokitteh-read-only · 2021-12-03T15:46:38Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #19186 was opened by nezdolik.

see: more, trace.

nezdolik · 2021-12-03T15:46:49Z

Working on tests

Signed-off-by: Kateryna Nezdolii <[email protected]>

nezdolik · 2021-12-07T23:22:04Z

My suggestion is not to add public docs on this monitor in this PR (until monitor is not wired up with connection tracking code within TcpListener in the next PR)

Signed-off-by: Kateryna Nezdolii <[email protected]>

nezdolik · 2021-12-08T21:26:31Z

cc @KBaichoo @antoniovicente @mattklein123

markdroth · 2021-12-09T00:07:05Z

/lgtm api

KBaichoo

Thanks for working on this! Here's a first pass

KBaichoo · 2021-12-09T20:45:19Z

api/envoy/extensions/resource_monitors/downstream_connections/v3/downstream_connections.proto

@@ -0,0 +1,22 @@
+syntax = "proto3";
+
+package envoy.extensions.resource_monitors.downstream_connections.v3;


should this be an v3alpha if this is an alpha extension? seems like it is from https://github.com/envoyproxy/envoy/pull/19186/files#diff-5fb7b50b079c821a11f230bec1d1ab9b05256541f14f8bc5045941d301759801

Looks like is not possible to specify any other version for extension than v3 according to api style guide for extensions. I've marked entire proto file for this extension as work in progress and hid it from the docs (until the third PR lands in that would make use of this new monitor in tcp listener)

KBaichoo · 2021-12-09T21:11:42Z

source/extensions/resource_monitors/downstream_connections/downstream_connections_monitor.h

+  int64_t maxResourceUsage() const override;
+
+protected:
+  uint64_t max_;


Thoughts on perhaps changing this, and where it's set in the proto to int64_t instead? The pgv rule can enforce this value is > 0 (if that was the concern) Currently, there are values this could be set to where the static cast is unsafe

KBaichoo · 2021-12-09T21:25:55Z

test/extensions/resource_monitors/downstream_connections/downstream_connections_monitor_test.cc

+  envoy::extensions::resource_monitors::downstream_connections::v3::DownstreamConnectionsConfig
+      config;
+  std::unique_ptr<ActiveDownstreamConnectionsResourceMonitor> monitor(
+      new ActiveDownstreamConnectionsResourceMonitor(config));


prefer make_unique (here and elsewhere)

KBaichoo · 2021-12-09T21:29:14Z

source/extensions/resource_monitors/downstream_connections/downstream_connections_monitor.cc

+        DownstreamConnectionsConfig& config)
+    : max_(config.max_active_downstream_connections()), current_(0){};
+
+bool ActiveDownstreamConnectionsResourceMonitor::tryAllocateResource(int64_t increment) {


In real use cases can tryAllocateResources and tryDeallocateResources be called concurrently?

yes, they can be called for example from multiple worker threads

Do think using RAII for the resource allocated / decremented is probably the way to go to stop incorrect from being possible and leaking resources.

For example if max = 15, and originally we have current=10 with some bad calls and threads swapping I think we can up with strange results:

Call to tryAlloc(10) that stops at line 20 (doesn't yet read current for the decrement)

Call to deAlloc(20) => current is still 20, so we decrement and store 0 into current

The first call reads current (now 0) and decrements 10, giving us current or -10 :(

Realised that RAII/Memento approach will not work with cross thread visibility. Various threads will be accessing tryAlloc/deAlloc resource via their thread local overload state object. The point of using atomic counter was to bypass periodic slow flushes/updates in OM (where thread local overload state is periodically updated for all threads) and instead perform faster checks from any thread relying on atomic guarantees for cross thread visibility of latest counter value.
@KBaichoo wdyt?

phlax · 2021-12-15T15:41:24Z

/wait

KBaichoo · 2021-12-22T21:19:07Z

/assign @KBaichoo

Signed-off-by: Kateryna Nezdolii <[email protected]>

nezdolik · 2021-12-29T13:14:14Z

need to increase coverage

jmarantz · 2022-01-04T14:37:28Z

/wait

Signed-off-by: Kateryna Nezdolii <[email protected]>

nezdolik · 2022-01-12T18:04:01Z

Code coverage for source/extensions/resource_monitors/downstream_connections is lower than limit of 96.6 (88.9)
Partially fixed coverage, still need to increase coverage up to at least 96.6.

KBaichoo · 2022-01-13T15:30:43Z

The lines missing for coverage:

https://storage.googleapis.com/envoy-pr/6101a36/coverage/source/extensions/resource_monitors/downstream_connections/downstream_connections_monitor.cc.gcov.html

Seems like they might be hard to reliably cover in a test since you'd need two concurrent calls to tryDeallocateResource(int64_t decrement) that would pass the release assert, but one of which would fail to decrement.

An alternative that would avoid that issue is using an opaque object, sort of like a momento https://en.wikipedia.org/wiki/Memento_pattern

e.g.

class ResourceStore {
  ResourceStore() : count_(0);
  // Allows this object to query this value, but for this to be opaque to others
  friend ActiveDownstreamConnectionsResourceMonitor;
  private:
   int64_t capacity_;
}

Then have your factory have a call to alloc this object, and then take it with increment and decrement to increase that capacity object. Because the capacity belongs to a given resource store, we shouldn't have issue of needing to revert a decrement b/c it goes negative.

KBaichoo · 2022-01-13T15:02:58Z

test/extensions/resource_monitors/downstream_connections/config_test.cc

+  auto factory =
+      Registry::FactoryRegistry<Server::Configuration::ProactiveResourceMonitorFactory>::getFactory(
+          "envoy.resource_monitors.downstream_connections");
+  EXPECT_NE(factory, nullptr);


This seems like it should be an ASSERT as we shouldn't be continuing the test if this fails. Here and below

KBaichoo · 2022-01-13T15:05:42Z

test/extensions/resource_monitors/downstream_connections/config_test.cc

+
+  envoy::extensions::resource_monitors::downstream_connections::v3::DownstreamConnectionsConfig
+      config;
+  config.set_max_active_downstream_connections(std::numeric_limits<uint64_t>::max());


Perhaps using -1 (or some other more direct invalid value), could be more clear than relying on uint64_t::max() bits to be interpreted as an int64_t value of -1.

KBaichoo · 2022-01-13T15:07:51Z

api/envoy/extensions/resource_monitors/downstream_connections/v3/downstream_connections.proto

+// [#not-implemented-hide:]
+message DownstreamConnectionsConfig {
+  // Maximum threshold for global open active downstream connections, defaults to 0.
+  // If monitor is configured via Overload manager api and has no value set, Envoy will reject all incoming connections.


Technically with the gt annotation that you added Envoy will reject the configuration as your test https://github.com/envoyproxy/envoy/pull/19186/files#diff-1fdfe762c6263d74ffc5365cd93a44ff14c64adac898bbf1e2de53a3a91e000bR61 shows.

Is this expected to be >= 0 or > 0?

"If monitor is configured via Overload manager api and has no value set, Envoy will reject all incoming connections."

I think this is up for discussion. Having it '>0' would require users to explicitly configure threshold for monitor (and fail/error on startup otherwise if threshold is not configured). While '>=0' can be more tricky for users, if they forget to configure the threshold, monitor will use default value 0 and reject all incoming connections. First option (require explicit value > 0) is cleaner in my opinion, although monitor will not be able to reject all incoming connections.

KBaichoo · 2022-01-14T13:54:42Z

/wait

nezdolik · 2022-01-27T18:14:21Z

Keepalive comment, am currently on parental leave with limited capacity but will try to get this finished.

Signed-off-by: Kateryna Nezdolii <[email protected]>

nezdolik · 2022-02-09T18:36:32Z

Please bear with my slowness. Applied review comments/nits, will take care of missing coverage for concurrent code block later this week.

KBaichoo

Thanks for making forward progress, on this even while your on leave. Also congratulations 🎉 .

This is nearly there.

KBaichoo · 2022-02-10T21:31:28Z

api/envoy/extensions/resource_monitors/downstream_connections/v3/downstream_connections.proto

+// [#not-implemented-hide:]
+message DownstreamConnectionsConfig {
+  // Maximum threshold for global open active downstream connections, defaults to 0.
+  // If monitor is configured via Overload manager api and has no value set, Envoy will reject all incoming connections.


KBaichoo · 2022-02-10T21:48:44Z

source/extensions/resource_monitors/downstream_connections/downstream_connections_monitor.cc

+        DownstreamConnectionsConfig& config)
+    : max_(config.max_active_downstream_connections()), current_(0){};
+
+bool ActiveDownstreamConnectionsResourceMonitor::tryAllocateResource(int64_t increment) {


Do think using RAII for the resource allocated / decremented is probably the way to go to stop incorrect from being possible and leaking resources.

For example if max = 15, and originally we have current=10 with some bad calls and threads swapping I think we can up with strange results:

Call to tryAlloc(10) that stops at line 20 (doesn't yet read current for the decrement)

Call to deAlloc(20) => current is still 20, so we decrement and store 0 into current

The first call reads current (now 0) and decrements 10, giving us current or -10 :(

KBaichoo · 2022-02-10T21:51:50Z

/wait

github-actions · 2022-03-20T04:01:17Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2022-03-27T08:01:13Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

[Overload] Active downstream connections resource monitor

7dc2ca5

Signed-off-by: Kateryna Nezdolii <[email protected]>

nezdolik requested a review from htuch as a code owner December 3, 2021 15:46

repokitteh-read-only bot added the api label Dec 3, 2021

repokitteh-read-only bot assigned markdroth Dec 3, 2021

nezdolik marked this pull request as draft December 3, 2021 15:47

Kateryna Nezdolii added 2 commits December 7, 2021 23:10

Add tests

3020230

Signed-off-by: Kateryna Nezdolii <[email protected]>

Update monitor docs

7ee2b48

Signed-off-by: Kateryna Nezdolii <[email protected]>

nezdolik marked this pull request as ready for review December 7, 2021 23:20

Kateryna Nezdolii added 2 commits December 7, 2021 23:28

Update extensions_metadata.yaml

77ee91c

Signed-off-by: Kateryna Nezdolii <[email protected]>

fix format

0ed8742

Signed-off-by: Kateryna Nezdolii <[email protected]>

repokitteh-read-only bot removed the api label Dec 9, 2021

KBaichoo reviewed Dec 9, 2021

View reviewed changes

repokitteh-read-only bot added the waiting label Dec 15, 2021

repokitteh-read-only bot assigned KBaichoo Dec 22, 2021

apply review comments

f6ed8e3

Signed-off-by: Kateryna Nezdolii <[email protected]>

repokitteh-read-only bot added api and removed waiting labels Dec 29, 2021

Kateryna Nezdolii added 3 commits December 29, 2021 09:25

fix format

8acddb5

Signed-off-by: Kateryna Nezdolii <[email protected]>

Apply api annotations to proto

0b505fa

Signed-off-by: Kateryna Nezdolii <[email protected]>

Apply api annotations to proto

6ac675f

Signed-off-by: Kateryna Nezdolii <[email protected]>

repokitteh-read-only bot added the waiting label Jan 4, 2022

Increase test coverage

0752313

Signed-off-by: Kateryna Nezdolii <[email protected]>

repokitteh-read-only bot removed the waiting label Jan 12, 2022

KBaichoo reviewed Jan 13, 2022

View reviewed changes

repokitteh-read-only bot added the waiting label Jan 14, 2022

Apply review comments

0e9f4df

Signed-off-by: Kateryna Nezdolii <[email protected]>

repokitteh-read-only bot removed the waiting label Feb 9, 2022

KBaichoo reviewed Feb 10, 2022

View reviewed changes

repokitteh-read-only bot added the waiting label Feb 10, 2022

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 20, 2022

github-actions bot closed this Mar 27, 2022

		@@ -0,0 +1,22 @@
		syntax = "proto3";

		package envoy.extensions.resource_monitors.downstream_connections.v3;

[Overload] Active downstream connections resource monitor #19186

[Overload] Active downstream connections resource monitor #19186

Conversation

nezdolik commented Dec 3, 2021 • edited Loading

repokitteh-read-only bot commented Dec 3, 2021

nezdolik commented Dec 3, 2021

nezdolik commented Dec 7, 2021 • edited Loading

nezdolik commented Dec 8, 2021

markdroth commented Dec 9, 2021

KBaichoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nezdolik Feb 18, 2022 • edited Loading

Choose a reason for hiding this comment

phlax commented Dec 15, 2021

KBaichoo commented Dec 22, 2021

nezdolik commented Dec 29, 2021

jmarantz commented Jan 4, 2022

nezdolik commented Jan 12, 2022

KBaichoo commented Jan 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KBaichoo commented Jan 14, 2022

nezdolik commented Jan 27, 2022

nezdolik commented Feb 9, 2022

KBaichoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KBaichoo commented Feb 10, 2022

github-actions bot commented Mar 20, 2022

github-actions bot commented Mar 27, 2022

nezdolik commented Dec 3, 2021 •

edited

Loading

nezdolik commented Dec 7, 2021 •

edited

Loading

nezdolik Feb 18, 2022 •

edited

Loading