-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CONTP-549] throttle the number of parallel client initial syncs in the tagger server #31741
[CONTP-549] throttle the number of parallel client initial syncs in the tagger server #31741
Conversation
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 97e21b6 Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | quality_gate_idle | memory utilization | +0.54 | [+0.50, +0.57] | 1 | Logs bounds checks dashboard |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +0.42 | [-0.26, +1.11] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.11 | [-0.59, +0.81] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | +0.03 | [-0.86, +0.91] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | +0.01 | [-0.07, +0.10] | 1 | Logs bounds checks dashboard |
➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.10, +0.11] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | +0.00 | [-0.89, +0.90] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | +0.00 | [-0.77, +0.77] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | 1 | Logs |
➖ | file_tree | memory utilization | -0.03 | [-0.16, +0.10] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | -0.07 | [-0.72, +0.58] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.14 | [-0.93, +0.65] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | -0.17 | [-0.63, +0.29] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | -0.18 | [-1.07, +0.70] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | -0.83 | [-0.90, -0.76] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | -0.90 | [-4.11, +2.31] | 1 | Logs |
Bounds Checks: ✅ Passed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | |
✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
fbcb3b7
to
a2e257b
Compare
a2e257b
to
63e9838
Compare
63e9838
to
b390f94
Compare
pkg/config/setup/config.go
Outdated
@@ -729,6 +729,9 @@ func InitConfig(config pkgconfigmodel.Setup) { | |||
config.BindEnvAndSetDefault("clc_runner_server_readheader_timeout", 10) | |||
config.BindEnvAndSetDefault("clc_runner_remote_tagger_enabled", false) | |||
|
|||
// Remote tagger | |||
config.BindEnvAndSetDefault("remote_tagger.max_concurrent_sync", 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open questions for discussion:
Do you think 4 is a good default value for this option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, a default value of 4
is quite high. Most of the setups I’ve seen typically use around 3
cluster-check-runner instances.
If we aim to minimize the impact of the initialization stream on the cluster-agent, a default value of 1
would be more appropriate. In most cases, this should suffice because cluster-check-runners are usually rolled out and updated incrementally, one at a time, based on the rollout strategy.
That said, a default value of 1
might be too low for some setups. I’d suggest setting the default to 2
as a balanced choice, though I acknowledge this decision is largely empirical and not based on concrete data.
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv aws.create-vm --pipeline-id=52097096 --os-family=ubuntu Note: This applies to commit c0967b0 |
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good,
I'm suggestion small nits to ease futur code investigation
|
||
timeoutRefreshError := make(chan error) | ||
|
||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondering if we can create a function to isolate this logic.
Also should it be the client
who send the keepAlive and not the server
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline regarding this.
In summary, this part of the code was already there, it is just moved up a bit.
We can work on refactoring this code in a subsequent PR.
subscriptionID := fmt.Sprintf("streaming-client-%s", streamingID) | ||
|
||
initBurst := true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a comment to explain what is initBurst
?
pkg/config/setup/config.go
Outdated
@@ -729,6 +729,9 @@ func InitConfig(config pkgconfigmodel.Setup) { | |||
config.BindEnvAndSetDefault("clc_runner_server_readheader_timeout", 10) | |||
config.BindEnvAndSetDefault("clc_runner_remote_tagger_enabled", false) | |||
|
|||
// Remote tagger | |||
config.BindEnvAndSetDefault("remote_tagger.max_concurrent_sync", 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, a default value of 4
is quite high. Most of the setups I’ve seen typically use around 3
cluster-check-runner instances.
If we aim to minimize the impact of the initialization stream on the cluster-agent, a default value of 1
would be more appropriate. In most cases, this should suffice because cluster-check-runners are usually rolled out and updated incrementally, one at a time, based on the rollout strategy.
That said, a default value of 1
might be too low for some setups. I’d suggest setting the default to 2
as a balanced choice, though I acknowledge this decision is largely empirical and not based on concrete data.
/merge |
Devflow running:
|
/merge |
Devflow running:
|
…he tagger server (#31741)
What does this PR do?
This PR adds throttling behaviour to the tagger server so that the number of clients (remote tagger) undergoing tagger initial sync in parallel is bound by a specific value.
Motivation
Flatten the memory and cpu usage curves on cluster agent startup to avoid OOMKills and container restarts on large clusters.
Such issues are visible on large clusters where:
Tested on large clusters, and saw big improvement by flattening Memory and CPU usage curves on startup of the DCA:
Describe how you validated your changes
The default behaviour should be preserved. E2E already test CLC runners and cluster checks on CLC runners.
We can validate throttling works by scaling up the number of replicas of the CLC runners, and setting the DCA log level to DEBUG, and verify in the DCA logs that we never have more than 4 CLC runner instances receiving the initial tagger burst at the same time.
It is difficult to test this locally because on local machine the workload is very small, so it is not easy to exhaust the throttler.
What we can do is check we can perform sync with a large number of CLC runners without issues.
Example:
By deploying this, you should be able to see in the DCA logs how the throttler token is requested, when the sync process has started/finished for each CLC runner. The logs should never show more than 4 CLC runners started syncing in parallel.
Example from the logs showing multiple CLC runners requesting tokens and waiting until they get their turn:
You can see we never have more than 4 progressing sync operations.
At the end, all CLC runners should have their taggers synced with the cluster agent (you can check this with
agent tagger-list
)Possible Drawbacks / Trade-offs
Slows down initial sync of tagger in CLC runners, however this is ok because it only happens on very large clusters with many CLC runners, and it allows the DCA to survive the initialisation phase without exceeding too much memory and cpu.
Additional Notes