-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMLII-1994] Improve expected_tags_duration for metrics #29634
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
…iring host tags is properly calculated
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv create-vm --pipeline-id=46767706 --os-family=ubuntu Note: This applies to commit 35be39e |
Regression DetectorRegression Detector ResultsRun ID: 85573b84-e3c0-4a6a-a404-00503f3dda40 Metrics dashboard Target profiles Baseline: 03a353c Performance changes are noted in the perf column of each table:
No significant changes in experiment optimization goalsConfidence level: 90.00% There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | idle_all_features | memory utilization | +1.42 | [+1.32, +1.52] | 1 | Logs bounds checks dashboard |
➖ | tcp_syslog_to_blackhole | ingress throughput | +1.06 | [+0.97, +1.14] | 1 | Logs |
➖ | pycheck_lots_of_tags | % cpu utilization | +0.50 | [-2.08, +3.08] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | +0.08 | [-0.16, +0.33] | 1 | Logs |
➖ | file_tree | memory utilization | +0.07 | [-0.06, +0.20] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | +0.01 | [-0.32, +0.34] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.01 | [-0.21, +0.23] | 1 | Logs |
➖ | basic_py_check | % cpu utilization | +0.00 | [-2.97, +2.97] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | -0.00 | [-0.19, +0.18] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.08, +0.07] | 1 | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -0.10 | [-0.83, +0.64] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.10 | [-0.59, +0.38] | 1 | Logs |
➖ | idle | memory utilization | -0.65 | [-0.70, -0.61] | 1 | Logs bounds checks dashboard |
➖ | otel_to_otel_logs | ingress throughput | -0.88 | [-1.69, -0.08] | 1 | Logs |
Bounds Checks
perf | experiment | bounds_check_name | replicates_passed |
---|---|---|---|
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 |
✅ | idle | memory_usage | 10/10 |
✅ | idle_all_features | memory_usage | 10/10 |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking suggestion! Approved for docs.
releasenotes/notes/fix-configurable-host-tags-c71fb5597d20bffe.yaml
Outdated
Show resolved
Hide resolved
NIT: Can we change the title to something like: "Improve |
….yaml Co-authored-by: Esther Kim <[email protected]>
…datadog-agent into refactor-configurable-host-tags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very promising and the numbers look great 👌
There are a few more files you can delete things from introduced in the old PR:
This function and it's usage:
func (c *WorkloadMetaCollector) handleHostTags(ev workloadmeta.Event) []*types.TagInfo { |
This const and it's usage:
KindHost Kind = "host" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
@@ -115,6 +119,10 @@ func createIterableMetrics( | |||
if isServerless { | |||
log.DebugfServerless("Sending sketches payload : %s", sketch.String()) | |||
} | |||
hostTags := hostTagProvider.GetHostTags() | |||
if hostTags != nil { | |||
sketch.Tags = tagset.CombineCompositeTagsAndSlice(sketch.Tags, hostTagProvider.GetHostTags()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sketch.Tags = tagset.CombineCompositeTagsAndSlice(sketch.Tags, hostTagProvider.GetHostTags()) | |
sketch.Tags = tagset.CombineCompositeTagsAndSlice(sketch.Tags, hostTags) |
NIT: to avoid calling GetHostTags
twice.
(same for line 109 above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question about the approach choose to fix the reported issue: why we don't keep the information into workloadmeta? I fill like in some environment (host less) it can still be interesting to have the logic, since the host-metadata payload is never sent.
also keeping host information inside workloadmeta could solve other issue in the future like, refreshing faster host tags when a new host tag appears after the agent has already sent the host metadata payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm for @DataDog/container-platform files
/merge |
🚂 MergeQueue: pull request added to the queue The median merge time in Use |
What does this PR do?
Fixes configurable host tags that are added to metrics expected_tags_duration is configured.
This PR changes the logic for adding host tags and moves it after aggregation occurs so that a metric's context hash does not change and the aggregator does not mistake the same metric as a new one.
Motivation
Issue
Describe how to test/QA your changes
To verify that configurable host tags work as expected
foo:bar
To verify that the bug was resolved
dev/dist/checks.d/testcheck.py
1m
, here is how the spike used to lookQA steps are also given in the ticket
Additional Notes
I've benchmarked these changes with dogstatsd stress tests and displayed the results in this document.
Notebook
RSS pod memory usage
Old host tag feature:
New host tag feature:
CPU usage
Old host tag feature
New host tag feature
Results:
3.7 % reduction in memory usage when tags are enabled while using the new host tag feature
1.8 % reduction in memory usage when tags are disabled while using the new host tag feature
1.7 % reduction in average cores used when tags are enabled while using the new host tag feature
3.5 % reduction in average cores used when tags are disabled while using the new host tag feature
The old version also has a CPU drop when the host tags expire which the new version does not have.