Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-24945: adds rate limiting of network events per container #1945

Merged
merged 13 commits into from
Nov 19, 2024

Conversation

Stringy
Copy link
Collaborator

@Stringy Stringy commented Nov 7, 2024

Description

This adds new rate limiting of network connections per container, on top of the existing afterglow and batching methods in order to provide a safety mechanism (upper limit) to the number of reported network connections for a given workload. When External IPs is enabled, the volume of connections will naturally be much higher, so this limit is primarily in service to controlling that volume.

Some caveats and decisions:

  • The rate limiting is implemented at the last possible moment, as the batch message is being constructed. This is means we do not rate limit connections that would otherwise be dropped by afterglow. It also means that we're rate limiting the delta between the old state and the new state.
  • Close events (those with a close timestamp) are not rate limited. Otherwise we could end up with a situation where a connection exists in Sensor's model, but its close event is never sent so that connection will exist forever.
  • The time window for rate limiting is pegged to the scrape interval, to provide a logical and well-understood time frame for scale. This means we can also think of this rate limit as a constraint on the maximum size of batch we'll ever send to Sensor.
  • The rate limiting is enabled regardless of external IPs. I think this makes the overall behaviour of network event reporting much clearer and more predictable, but I am not married to this notion and should we want to retain old behaviour for the vast majority of cases, we can change this.

Depends on: stackrox/stackrox#13228

Checklist

  • Investigated and inspected CI test results
  • Updated documentation accordingly

Automated testing

  • Added unit tests
  • Added integration tests
  • Added regression tests

If any of these don't apply, please comment below.

Testing Performed

TODO(replace-me)
Use this space to explain how you tested your PR, or, if you didn't test it, why you did not do so. (Valid reasons include "CI is sufficient" or "No testable changes")
In addition to reviewing your code, reviewers must also review your testing instructions, and make sure they are sufficient.

For more details, ref the Confluence page about this section.

@Stringy Stringy requested a review from a team as a code owner November 7, 2024 09:20
@@ -107,6 +107,16 @@ class CollectorConfig {
return enable_external_ips_;
}

int64_t PerContainerRateLimit() const {
std::shared_lock<std::shared_mutex> lock(mutex_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] There's a helper method that should make it a bit clear this is a read lock.

Suggested change
std::shared_lock<std::shared_mutex> lock(mutex_);
auto lock = ReadLock();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +112 to +117
if (runtime_config_.has_value()) {
return runtime_config_.value()
.networking()
.per_container_rate_limit();
}
return per_container_rate_limit_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handling is getting out of hand, we need to make it so all this configuration is always stored in the runtime_config_ object and, if we are not using it, just overwrite the value in there ourselves.

@@ -1,10 +1,13 @@
#include "NetworkSignalHandler.h"

#include <chrono>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need chrono now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't - removed

@@ -147,6 +150,7 @@ SignalHandler::Result NetworkSignalHandler::HandleSignal(sinsp_evt* evt) {
return SignalHandler::IGNORED;
}

CLOG(DEBUG) << "Live event: " << result.value();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time I tried adding something like this it completely destroyed performance in debug mode, have you tried it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was just for manual verification - removed now.

@@ -1,10 +1,13 @@
#include "NetworkStatusNotifier.h"

#include <sstream>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we need sstream? I would assume Logging.h is enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -31,7 +32,8 @@ class NetworkStatusNotifier : protected ProtoAllocator<sensor::NetworkConnection
config_(config),
comm_(comm),
connections_total_reporter_(connections_total_reporter),
connections_rate_reporter_(connections_rate_reporter) {
connections_rate_reporter_(connections_rate_reporter),
connections_rate_limiter_(RateLimitCache(config.PerContainerRateLimit(), config.PerContainerRateLimit(), config.ScrapeInterval())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we constructing a RateLimitCache and then copy constructing another one? Couldn't we just...

Suggested change
connections_rate_limiter_(RateLimitCache(config.PerContainerRateLimit(), config.PerContainerRateLimit(), config.ScrapeInterval())) {
connections_rate_limiter_(config.PerContainerRateLimit(), config.PerContainerRateLimit(), config.ScrapeInterval()) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was due to my ignorance about how these constructors worked - my bad! I've removed the connections_rate_limiter_ member now anyway so this problem has also been removed.

s.StartContainerStats()
image_store := config.Images()

berserker_img := image_store.QaImageByKey("performance-berserker")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦀

Comment on lines 22 to 31
berserker_img := image_store.QaImageByKey("performance-berserker")

images := []string{
berserker_img,
}

for _, image := range images {
err := s.executor.PullImage(image)
s.Require().NoError(err)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean...

Suggested change
berserker_img := image_store.QaImageByKey("performance-berserker")
images := []string{
berserker_img,
}
for _, image := range images {
err := s.executor.PullImage(image)
s.Require().NoError(err)
}
berserker_img := image_store.QaImageByKey("performance-berserker")
err := s.Executor().PullImage(berserker_img)
s.Require().NoError(err)

Unless you are actively planning to add more images to the test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've (temporarily) removed this test suite now, in favour of unit tests. I'll work on it on the side and add it back in in another PR.

Comment on lines 45 to 57
berserkerID, err := s.Executor().StartContainer(config.ContainerStartConfig{
Name: "berserker-server",
Image: berserker_img,
Env: map[string]string{
"BERSERKER__WORKLOAD__NCONNECTIONS": "100",
"RUST_BACKTRACE": "full",
},
Command: []string{"/etc/berserker/network/server.toml"},
NetworkMode: "host",
Privileged: true,
})
s.Require().NoError(err)
s.serverID = berserkerID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick]

Suggested change
berserkerID, err := s.Executor().StartContainer(config.ContainerStartConfig{
Name: "berserker-server",
Image: berserker_img,
Env: map[string]string{
"BERSERKER__WORKLOAD__NCONNECTIONS": "100",
"RUST_BACKTRACE": "full",
},
Command: []string{"/etc/berserker/network/server.toml"},
NetworkMode: "host",
Privileged: true,
})
s.Require().NoError(err)
s.serverID = berserkerID
s.serverID, err = s.Executor().StartContainer(config.ContainerStartConfig{
Name: "berserker-server",
Image: berserker_img,
Env: map[string]string{
"BERSERKER__WORKLOAD__NCONNECTIONS": "100",
"RUST_BACKTRACE": "full",
},
Command: []string{"/etc/berserker/network/server.toml"},
NetworkMode: "host",
Privileged: true,
})
s.Require().NoError(err)

Comment on lines 59 to 72
berserkerID, err = s.Executor().StartContainer(config.ContainerStartConfig{
Name: "berserker-client",
Image: berserker_img,
Env: map[string]string{
"BERSERKER__WORKLOAD__NCONNECTIONS": "100",
"RUST_BACKTRACE": "full",
},
Command: []string{"/etc/berserker/network/client.toml"},
NetworkMode: "host",
Privileged: true,
})
s.Require().NoError(err)
s.clientID = berserkerID
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick]

Suggested change
berserkerID, err = s.Executor().StartContainer(config.ContainerStartConfig{
Name: "berserker-client",
Image: berserker_img,
Env: map[string]string{
"BERSERKER__WORKLOAD__NCONNECTIONS": "100",
"RUST_BACKTRACE": "full",
},
Command: []string{"/etc/berserker/network/client.toml"},
NetworkMode: "host",
Privileged: true,
})
s.Require().NoError(err)
s.clientID = berserkerID
}
s.clientID, err = s.Executor().StartContainer(config.ContainerStartConfig{
Name: "berserker-client",
Image: berserker_img,
Env: map[string]string{
"BERSERKER__WORKLOAD__NCONNECTIONS": "100",
"RUST_BACKTRACE": "full",
},
Command: []string{"/etc/berserker/network/client.toml"},
NetworkMode: "host",
Privileged: true,
})
s.Require().NoError(err)
}

@Stringy Stringy force-pushed the giles/ROX-24945-connections-release-valve branch from 8a18ea2 to 0493951 Compare November 8, 2024 10:36
@Stringy Stringy requested a review from Molter73 November 8, 2024 10:40
@Stringy Stringy requested a review from ovalenti November 8, 2024 12:49
Copy link
Collaborator

@Molter73 Molter73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a few comments but nothing major.

@@ -5,6 +5,7 @@
#include <libsinsp/sinsp.h>

#include "EventMap.h"
#include "Logging.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can now remove this include as well.

@@ -73,6 +78,7 @@ class NetworkStatusNotifier : protected ProtoAllocator<sensor::NetworkConnection
std::shared_ptr<CollectorConnectionStats<float>> connections_rate_reporter_;
std::chrono::steady_clock::time_point connections_last_report_time_; // time delta between the current reporting and the previous (rate computation)
std::optional<ConnectionTracker::Stats> connections_rate_counter_last_; // previous counter values (rate computation)
//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//

Comment on lines +365 to +368
Connection conn1("containerId", Endpoint(Address(10, 0, 1, 32), 1024), Endpoint(Address(192, 168, 0, 1), 1000), L4Proto::TCP, true);
Connection conn2("containerId", Endpoint(Address(10, 0, 1, 32), 1024), Endpoint(Address(192, 168, 0, 2), 1001), L4Proto::TCP, true);
Connection conn3("containerId", Endpoint(Address(10, 0, 1, 32), 1024), Endpoint(Address(192, 168, 0, 3), 1002), L4Proto::TCP, true);
Connection conn4("containerId", Endpoint(Address(10, 0, 1, 32), 1024), Endpoint(Address(192, 168, 0, 4), 1003), L4Proto::TCP, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we group these in a vector or array?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried this out but I think I prefer the clarity of statically defining the connections and the deltas - manipulating vectors hides a bit of the intention behind each one. To make it all a bit clearer, though, I've statically define active and closed statuses so there's no more duplication in the deltas

Comment on lines +370 to +371
Connection otherConn1("containerIdOther", Endpoint(Address(10, 10, 1, 32), 1024), Endpoint(Address(192, 168, 0, 1), 1000), L4Proto::TCP, true);
Connection otherConn2("containerIdOther", Endpoint(Address(10, 10, 1, 32), 1024), Endpoint(Address(192, 168, 0, 2), 1001), L4Proto::TCP, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for these two to be grouped in a vector or array.

Comment on lines 374 to 379
ConnMap deltaSingle = {
{conn1, ConnStatus(1234, true)},
{conn2, ConnStatus(1234, true)},
{conn3, ConnStatus(1234, true)},
{conn4, ConnStatus(1234, true)},
};
Copy link
Collaborator

@Molter73 Molter73 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the connections in a container, this can be generated in a for loop:

Suggested change
ConnMap deltaSingle = {
{conn1, ConnStatus(1234, true)},
{conn2, ConnStatus(1234, true)},
{conn3, ConnStatus(1234, true)},
{conn4, ConnStatus(1234, true)},
};
ConnMap deltaSingle{};
for {const auto& conn : connections) {
deltaSingle.emplace(conn, ConnStatus(1234, true));
};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably something similar you can do to deltaDuo and deltaClose.

Comment on lines 400 to 403
google::protobuf::RepeatedPtrField<sensor::NetworkConnection>
updatesSingle,
updatesDuo,
updatesClose;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] I had a bit of problem understanding these are 3 separate variables of type google::protobuf::RepeatedPtrField<sensor::NetworkConnection>, probably a skill issue, so feel free to ignore this, but you could define a shorter name for the type and declare all three variables separately:

Suggested change
google::protobuf::RepeatedPtrField<sensor::NetworkConnection>
updatesSingle,
updatesDuo,
updatesClose;
using UpdatedConnections = google::protobuf::RepeatedPtrField<sensor::NetworkConnection>;
UpdatedConnections updatesSingle;
UpdatedConnections updatesDuo;
UpdatedConnections updatesClose;

Copy link
Collaborator

@Molter73 Molter73 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could use an array

Suggested change
google::protobuf::RepeatedPtrField<sensor::NetworkConnection>
updatesSingle,
updatesDuo,
updatesClose;
using UpdatedConnections = google::protobuf::RepeatedPtrField<sensor::NetworkConnection>;
std::array<UpdatedConnections, 3> updatedConnections;

Copy link
Contributor

@ovalenti ovalenti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the stats counter. Overall, this looks good !

I tend to agree with @Molter73 on the fact that it would be easier to make sure that the runtime config always contains the value to use (even if it requires us to write the default values there).

collector/test/NetworkStatusNotifierTest.cpp Outdated Show resolved Hide resolved
@Stringy
Copy link
Collaborator Author

Stringy commented Nov 18, 2024

it would be easier to make sure that the runtime config always contains the value to use (even if it requires us to write the default values there).

Yes, I agree, though I don't think we can define the defaults in the proto itself (due to limitations with protobufs) so I think we'll need to enforce them somewhere. It might make sense to encapsulate that in Collector, but outside of the existing CollectorConfig class. I'll look into it in a follow up PR.

Copy link
Contributor

@JoukoVirtanen JoukoVirtanen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Stringy Stringy merged commit db929ab into master Nov 19, 2024
50 of 53 checks passed
@Stringy Stringy deleted the giles/ROX-24945-connections-release-valve branch November 19, 2024 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants