[NDMII-3236] Update devicecheck profile when ProfileProvider has changed #32185

dplepage-dd · 2024-12-16T03:36:14Z

What does this PR do?

This PR adds a method to the ProfileProvider that indicates when the profiles were last updated, and updates the device check to cache the timestamp of the last time it gathered the metrics, tags, etc. from the matching profile, so that it can regenerate them if the ProfileProvider receives new profiles.

This also pulls the cached values into the DeviceCheck, instead of storing them on the CheckConfig, and since the values being cached are a set of metrics, a set of SNMP tags, a set of static tags, and a Metadata object, we now store them in a Profile instead of as separate fields.

Motivation

https://datadoghq.atlassian.net/browse/NDMII-3236

Describe how you validated your changes

Unit tests pass; I haven't QA'd it locally yet but will do so on Monday 12/16.

Additional Notes

Although this adds support for the set of profiles changing, no ProfileProvider exists yet that actually does change them over time, so the behavior of the agent should be unchanged, EXCEPT in one very specific case: If an integration is configured with an explicit profile name, and that profile doesn't exist, the current behavior is to error out; after this PR, the integration will log a warning but will not error out, and any metrics or tags explicitly present in the initConfig will be monitored even though the profile could not be found. If the profile becomes available later (currently impossible), the check will begin working again.

The .Metrics, .MetricTags, etc. fields on CheckConfig are storing all the same attributes as a profile; it is easier if we just create a Profile from the CheckConfig and pass that around. This also removes the caching of those fields from the CheckConfig; a later commit caches this generated profile in the DeviceCheck.

agent-platform-auto-pr · 2024-12-16T04:17:53Z

Uncompressed package size comparison

Comparison with ancestor 786bce8532868cc9f21e4f45233c4c517d36bdd9

Diff per package

package	diff	status	size	ancestor	threshold
datadog-agent-arm64-deb	0.02MB	⚠️	1003.96MB	1003.94MB	140.00MB
datadog-agent-aarch64-rpm	0.02MB	⚠️	1013.17MB	1013.15MB	140.00MB
datadog-heroku-agent-amd64-deb	0.01MB	⚠️	505.44MB	505.42MB	70.00MB
datadog-iot-agent-x86_64-rpm	0.01MB	⚠️	113.36MB	113.35MB	10.00MB
datadog-iot-agent-x86_64-suse	0.01MB	⚠️	113.36MB	113.35MB	10.00MB
datadog-iot-agent-amd64-deb	0.01MB	⚠️	113.29MB	113.28MB	10.00MB
datadog-agent-x86_64-rpm	0.00MB	⚠️	1278.10MB	1278.10MB	140.00MB
datadog-agent-x86_64-suse	0.00MB	⚠️	1278.10MB	1278.10MB	140.00MB
datadog-agent-amd64-deb	0.00MB	⚠️	1268.87MB	1268.87MB	140.00MB
datadog-iot-agent-arm64-deb	0.00MB	⚠️	108.76MB	108.76MB	10.00MB
datadog-iot-agent-aarch64-rpm	0.00MB	⚠️	108.83MB	108.83MB	10.00MB
datadog-dogstatsd-amd64-deb	0.00MB	✅	78.52MB	78.52MB	10.00MB
datadog-dogstatsd-x86_64-rpm	0.00MB	✅	78.59MB	78.59MB	10.00MB
datadog-dogstatsd-x86_64-suse	0.00MB	✅	78.59MB	78.59MB	10.00MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	55.74MB	55.74MB	10.00MB

Decision

⚠️ Warning

agent-platform-auto-pr · 2024-12-16T04:18:34Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv aws.create-vm --pipeline-id=51169033 --os-family=ubuntu

Note: This applies to commit d43ce0b

cit-pr-commenter · 2024-12-16T04:41:04Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 550be4ec-c872-457f-86c2-4071ecc4106b

Baseline: 786bce8
Comparison: 6aa91c6
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	quality_gate_idle_all_features	memory utilization	+2.22	[+2.11, +2.33]	1	Logs bounds checks dashboard
➖	otel_to_otel_logs	ingress throughput	+1.02	[+0.35, +1.69]	1	Logs
➖	file_tree	memory utilization	+0.65	[+0.54, +0.77]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.31	[-0.42, +1.05]	1	Logs
➖	quality_gate_logs	% cpu utilization	+0.11	[-2.84, +3.06]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	+0.05	[-0.41, +0.51]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.04	[-0.73, +0.81]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	+0.04	[-0.59, +0.67]	1	Logs
➖	file_to_blackhole_0ms_latency_http2	egress throughput	+0.03	[-0.82, +0.87]	1	Logs
➖	file_to_blackhole_0ms_latency_http1	egress throughput	+0.01	[-0.81, +0.83]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.01, +0.01]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.10, +0.09]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	-0.02	[-0.81, +0.76]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.03	[-0.78, +0.72]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-0.17	[-0.23, -0.12]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.29	[-1.08, +0.51]	1	Logs
➖	quality_gate_idle	memory utilization	-0.49	[-0.53, -0.44]	1	Logs bounds checks dashboard

Bounds Checks: ❌ Failed

perf	experiment	bounds_check_name	replicates_passed	links
❌	file_to_blackhole_0ms_latency_http2	lost_bytes	9/10
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http1	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http1	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http2	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10
✅	quality_gate_logs	memory_usage	10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.

dplepage-dd · 2024-12-18T23:47:41Z

Closing this to reopen based on a different branch (I'm learning how to use git machete to manage stacked PRs properly)

dplepage-dd added 12 commits December 14, 2024 15:53

Don't convert []string to map[string]string and back.

0fdf688

Add helper to get static vendor from a profile.

cb18d24

Have ProfileProvider report last modification time.

546a99a

Add profile names to profiles when loading from yaml.

629c182

Extract scalar/column OIDs from profiles.

8319406

Populate names on initconfig profiles.

10478b5

Make error more readable when no profiles match a sysobjectid.

6aa91c6

Only pass fields we need into Fetch.

66ccd37

Accept profile in report methods.

15e9b64

Cache the latest profile in devicecheck.

f26b20c

Provide empty ProfileProviders in discovery tests.

c602b9c

dplepage-dd added changelog/no-changelog team/network-device-monitoring team/ndm-core labels Dec 16, 2024

dplepage-dd requested a review from a team as a code owner December 16, 2024 03:36

github-actions bot added the long review PR is complex, plan time to review it label Dec 16, 2024

dplepage-dd added 2 commits December 15, 2024 22:47

Remove debug printlns

cc18536

Spacing fix.

d43ce0b

dplepage-dd closed this Dec 18, 2024

dplepage-dd deleted the dpl/cache-in-devicecheck branch December 18, 2024 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NDMII-3236] Update devicecheck profile when ProfileProvider has changed #32185

[NDMII-3236] Update devicecheck profile when ProfileProvider has changed #32185

dplepage-dd commented Dec 16, 2024

agent-platform-auto-pr bot commented Dec 16, 2024

agent-platform-auto-pr bot commented Dec 16, 2024

cit-pr-commenter bot commented Dec 16, 2024 •

edited

Loading

Fine details of change detection per experiment

Explanation

dplepage-dd commented Dec 18, 2024

[NDMII-3236] Update devicecheck profile when ProfileProvider has changed #32185

[NDMII-3236] Update devicecheck profile when ProfileProvider has changed #32185

Conversation

dplepage-dd commented Dec 16, 2024

What does this PR do?

Motivation

Describe how you validated your changes

Additional Notes

agent-platform-auto-pr bot commented Dec 16, 2024

Uncompressed package size comparison

Decision

agent-platform-auto-pr bot commented Dec 16, 2024

Test changes on VM

cit-pr-commenter bot commented Dec 16, 2024 • edited Loading

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ❌ Failed

Explanation

CI Pass/Fail Decision

dplepage-dd commented Dec 18, 2024

cit-pr-commenter bot commented Dec 16, 2024 •

edited

Loading