-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROX-24945: adds rate limiting of network events per container #1945
Conversation
collector/lib/CollectorConfig.h
Outdated
@@ -107,6 +107,16 @@ class CollectorConfig { | |||
return enable_external_ips_; | |||
} | |||
|
|||
int64_t PerContainerRateLimit() const { | |||
std::shared_lock<std::shared_mutex> lock(mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] There's a helper method that should make it a bit clear this is a read lock.
std::shared_lock<std::shared_mutex> lock(mutex_); | |
auto lock = ReadLock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (runtime_config_.has_value()) { | ||
return runtime_config_.value() | ||
.networking() | ||
.per_container_rate_limit(); | ||
} | ||
return per_container_rate_limit_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This handling is getting out of hand, we need to make it so all this configuration is always stored in the runtime_config_
object and, if we are not using it, just overwrite the value in there ourselves.
@@ -1,10 +1,13 @@ | |||
#include "NetworkSignalHandler.h" | |||
|
|||
#include <chrono> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need chrono now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't - removed
@@ -147,6 +150,7 @@ SignalHandler::Result NetworkSignalHandler::HandleSignal(sinsp_evt* evt) { | |||
return SignalHandler::IGNORED; | |||
} | |||
|
|||
CLOG(DEBUG) << "Live event: " << result.value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time I tried adding something like this it completely destroyed performance in debug mode, have you tried it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was just for manual verification - removed now.
@@ -1,10 +1,13 @@ | |||
#include "NetworkStatusNotifier.h" | |||
|
|||
#include <sstream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we need sstream? I would assume Logging.h is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -31,7 +32,8 @@ class NetworkStatusNotifier : protected ProtoAllocator<sensor::NetworkConnection | |||
config_(config), | |||
comm_(comm), | |||
connections_total_reporter_(connections_total_reporter), | |||
connections_rate_reporter_(connections_rate_reporter) { | |||
connections_rate_reporter_(connections_rate_reporter), | |||
connections_rate_limiter_(RateLimitCache(config.PerContainerRateLimit(), config.PerContainerRateLimit(), config.ScrapeInterval())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we constructing a RateLimitCache
and then copy constructing another one? Couldn't we just...
connections_rate_limiter_(RateLimitCache(config.PerContainerRateLimit(), config.PerContainerRateLimit(), config.ScrapeInterval())) { | |
connections_rate_limiter_(config.PerContainerRateLimit(), config.PerContainerRateLimit(), config.ScrapeInterval()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was due to my ignorance about how these constructors worked - my bad! I've removed the connections_rate_limiter_ member now anyway so this problem has also been removed.
s.StartContainerStats() | ||
image_store := config.Images() | ||
|
||
berserker_img := image_store.QaImageByKey("performance-berserker") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🦀
berserker_img := image_store.QaImageByKey("performance-berserker") | ||
|
||
images := []string{ | ||
berserker_img, | ||
} | ||
|
||
for _, image := range images { | ||
err := s.executor.PullImage(image) | ||
s.Require().NoError(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean...
berserker_img := image_store.QaImageByKey("performance-berserker") | |
images := []string{ | |
berserker_img, | |
} | |
for _, image := range images { | |
err := s.executor.PullImage(image) | |
s.Require().NoError(err) | |
} | |
berserker_img := image_store.QaImageByKey("performance-berserker") | |
err := s.Executor().PullImage(berserker_img) | |
s.Require().NoError(err) |
Unless you are actively planning to add more images to the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've (temporarily) removed this test suite now, in favour of unit tests. I'll work on it on the side and add it back in in another PR.
berserkerID, err := s.Executor().StartContainer(config.ContainerStartConfig{ | ||
Name: "berserker-server", | ||
Image: berserker_img, | ||
Env: map[string]string{ | ||
"BERSERKER__WORKLOAD__NCONNECTIONS": "100", | ||
"RUST_BACKTRACE": "full", | ||
}, | ||
Command: []string{"/etc/berserker/network/server.toml"}, | ||
NetworkMode: "host", | ||
Privileged: true, | ||
}) | ||
s.Require().NoError(err) | ||
s.serverID = berserkerID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick]
berserkerID, err := s.Executor().StartContainer(config.ContainerStartConfig{ | |
Name: "berserker-server", | |
Image: berserker_img, | |
Env: map[string]string{ | |
"BERSERKER__WORKLOAD__NCONNECTIONS": "100", | |
"RUST_BACKTRACE": "full", | |
}, | |
Command: []string{"/etc/berserker/network/server.toml"}, | |
NetworkMode: "host", | |
Privileged: true, | |
}) | |
s.Require().NoError(err) | |
s.serverID = berserkerID | |
s.serverID, err = s.Executor().StartContainer(config.ContainerStartConfig{ | |
Name: "berserker-server", | |
Image: berserker_img, | |
Env: map[string]string{ | |
"BERSERKER__WORKLOAD__NCONNECTIONS": "100", | |
"RUST_BACKTRACE": "full", | |
}, | |
Command: []string{"/etc/berserker/network/server.toml"}, | |
NetworkMode: "host", | |
Privileged: true, | |
}) | |
s.Require().NoError(err) |
berserkerID, err = s.Executor().StartContainer(config.ContainerStartConfig{ | ||
Name: "berserker-client", | ||
Image: berserker_img, | ||
Env: map[string]string{ | ||
"BERSERKER__WORKLOAD__NCONNECTIONS": "100", | ||
"RUST_BACKTRACE": "full", | ||
}, | ||
Command: []string{"/etc/berserker/network/client.toml"}, | ||
NetworkMode: "host", | ||
Privileged: true, | ||
}) | ||
s.Require().NoError(err) | ||
s.clientID = berserkerID | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick]
berserkerID, err = s.Executor().StartContainer(config.ContainerStartConfig{ | |
Name: "berserker-client", | |
Image: berserker_img, | |
Env: map[string]string{ | |
"BERSERKER__WORKLOAD__NCONNECTIONS": "100", | |
"RUST_BACKTRACE": "full", | |
}, | |
Command: []string{"/etc/berserker/network/client.toml"}, | |
NetworkMode: "host", | |
Privileged: true, | |
}) | |
s.Require().NoError(err) | |
s.clientID = berserkerID | |
} | |
s.clientID, err = s.Executor().StartContainer(config.ContainerStartConfig{ | |
Name: "berserker-client", | |
Image: berserker_img, | |
Env: map[string]string{ | |
"BERSERKER__WORKLOAD__NCONNECTIONS": "100", | |
"RUST_BACKTRACE": "full", | |
}, | |
Command: []string{"/etc/berserker/network/client.toml"}, | |
NetworkMode: "host", | |
Privileged: true, | |
}) | |
s.Require().NoError(err) | |
} |
8a18ea2
to
0493951
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a few comments but nothing major.
@@ -5,6 +5,7 @@ | |||
#include <libsinsp/sinsp.h> | |||
|
|||
#include "EventMap.h" | |||
#include "Logging.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we can now remove this include as well.
@@ -73,6 +78,7 @@ class NetworkStatusNotifier : protected ProtoAllocator<sensor::NetworkConnection | |||
std::shared_ptr<CollectorConnectionStats<float>> connections_rate_reporter_; | |||
std::chrono::steady_clock::time_point connections_last_report_time_; // time delta between the current reporting and the previous (rate computation) | |||
std::optional<ConnectionTracker::Stats> connections_rate_counter_last_; // previous counter values (rate computation) | |||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// |
Connection conn1("containerId", Endpoint(Address(10, 0, 1, 32), 1024), Endpoint(Address(192, 168, 0, 1), 1000), L4Proto::TCP, true); | ||
Connection conn2("containerId", Endpoint(Address(10, 0, 1, 32), 1024), Endpoint(Address(192, 168, 0, 2), 1001), L4Proto::TCP, true); | ||
Connection conn3("containerId", Endpoint(Address(10, 0, 1, 32), 1024), Endpoint(Address(192, 168, 0, 3), 1002), L4Proto::TCP, true); | ||
Connection conn4("containerId", Endpoint(Address(10, 0, 1, 32), 1024), Endpoint(Address(192, 168, 0, 4), 1003), L4Proto::TCP, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we group these in a vector or array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried this out but I think I prefer the clarity of statically defining the connections and the deltas - manipulating vectors hides a bit of the intention behind each one. To make it all a bit clearer, though, I've statically define active and closed statuses so there's no more duplication in the deltas
Connection otherConn1("containerIdOther", Endpoint(Address(10, 10, 1, 32), 1024), Endpoint(Address(192, 168, 0, 1), 1000), L4Proto::TCP, true); | ||
Connection otherConn2("containerIdOther", Endpoint(Address(10, 10, 1, 32), 1024), Endpoint(Address(192, 168, 0, 2), 1001), L4Proto::TCP, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for these two to be grouped in a vector or array.
ConnMap deltaSingle = { | ||
{conn1, ConnStatus(1234, true)}, | ||
{conn2, ConnStatus(1234, true)}, | ||
{conn3, ConnStatus(1234, true)}, | ||
{conn4, ConnStatus(1234, true)}, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the connections in a container, this can be generated in a for loop:
ConnMap deltaSingle = { | |
{conn1, ConnStatus(1234, true)}, | |
{conn2, ConnStatus(1234, true)}, | |
{conn3, ConnStatus(1234, true)}, | |
{conn4, ConnStatus(1234, true)}, | |
}; | |
ConnMap deltaSingle{}; | |
for {const auto& conn : connections) { | |
deltaSingle.emplace(conn, ConnStatus(1234, true)); | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably something similar you can do to deltaDuo and deltaClose.
google::protobuf::RepeatedPtrField<sensor::NetworkConnection> | ||
updatesSingle, | ||
updatesDuo, | ||
updatesClose; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] I had a bit of problem understanding these are 3 separate variables of type google::protobuf::RepeatedPtrField<sensor::NetworkConnection>
, probably a skill issue, so feel free to ignore this, but you could define a shorter name for the type and declare all three variables separately:
google::protobuf::RepeatedPtrField<sensor::NetworkConnection> | |
updatesSingle, | |
updatesDuo, | |
updatesClose; | |
using UpdatedConnections = google::protobuf::RepeatedPtrField<sensor::NetworkConnection>; | |
UpdatedConnections updatesSingle; | |
UpdatedConnections updatesDuo; | |
UpdatedConnections updatesClose; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you could use an array
google::protobuf::RepeatedPtrField<sensor::NetworkConnection> | |
updatesSingle, | |
updatesDuo, | |
updatesClose; | |
using UpdatedConnections = google::protobuf::RepeatedPtrField<sensor::NetworkConnection>; | |
std::array<UpdatedConnections, 3> updatedConnections; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the stats counter. Overall, this looks good !
I tend to agree with @Molter73 on the fact that it would be easier to make sure that the runtime config always contains the value to use (even if it requires us to write the default values there).
Yes, I agree, though I don't think we can define the defaults in the proto itself (due to limitations with protobufs) so I think we'll need to enforce them somewhere. It might make sense to encapsulate that in Collector, but outside of the existing CollectorConfig class. I'll look into it in a follow up PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
This adds new rate limiting of network connections per container, on top of the existing afterglow and batching methods in order to provide a safety mechanism (upper limit) to the number of reported network connections for a given workload. When External IPs is enabled, the volume of connections will naturally be much higher, so this limit is primarily in service to controlling that volume.
Some caveats and decisions:
Depends on: stackrox/stackrox#13228
Checklist
Automated testing
If any of these don't apply, please comment below.
Testing Performed
TODO(replace-me)
Use this space to explain how you tested your PR, or, if you didn't test it, why you did not do so. (Valid reasons include "CI is sufficient" or "No testable changes")
In addition to reviewing your code, reviewers must also review your testing instructions, and make sure they are sufficient.
For more details, ref the Confluence page about this section.