[NPM-3571] Include system-probe telemetry in agent flare #29768

pimlu · 2024-10-03T14:32:58Z

What does this PR do?

This PR adds the /telemetry endpoint from the system-probe unix socket server into the agent flare.

Motivation

This is part of an ongoing effort to add more information into the agent flare for NPM to reduce the round trips and TTR for support issues.

Describe how to test/QA your changes

While system-probe is running, you can view the /telemetry endpoint's data with one of these two commands (the path varies depending on config):

sudo curl --unix-socket /var/run/sysprobe.sock http://unix/
sudo curl --unix-socket /opt/datadog-agent/run/sysprobe.sock http://unix/telemetry

That output should match what you get when you run datadog-agent flare and unzip to get system_probe_telemetry.txt

Possible Drawbacks / Trade-offs

Additional Notes

system_probe_telemetry.txt is arbitrary and we can change it.

cit-pr-commenter · 2024-10-03T14:36:46Z

Go Package Import Differences

Baseline: 4c69af7
Comparison: 7dc7b6c

binary	os	arch	change
agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
agent	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
agent	darwin	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
agent	darwin	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
iot-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
iot-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
heroku-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
cluster-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
cluster-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
cluster-agent-cloudfoundry	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
cluster-agent-cloudfoundry	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
security-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe
security-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/flare/sysprobe

pr-commenter · 2024-10-03T15:13:17Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=46178513 --os-family=ubuntu

Note: This applies to commit 7dc7b6c

pr-commenter · 2024-10-03T15:52:01Z

Regression Detector

Regression Detector Results

Run ID: 9f28e9b-70dd-49c7-8f20-e154188227ea Metrics dashboard Target profiles

Baseline: 4c69af7
Comparison: 7dc7b6c

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	idle_all_features	memory utilization	+0.71	[+0.62, +0.80]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.56	[-0.18, +1.29]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	+0.09	[+0.04, +0.13]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	+0.01	[-0.10, +0.12]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.01, +0.01]	1	Logs
➖	file_tree	memory utilization	-0.21	[-0.32, -0.10]	1	Logs
➖	idle	memory utilization	-0.37	[-0.41, -0.32]	1	Logs
➖	basic_py_check	% cpu utilization	-0.54	[-3.29, +2.21]	1	Logs
➖	pycheck_lots_of_tags	% cpu utilization	-1.43	[-3.91, +1.04]	1	Logs
➖	otel_to_otel_logs	ingress throughput	-1.88	[-2.69, -1.08]	1	Logs

Bounds Checks

perf	experiment	bounds_check_name	replicates_passed
✅	idle	memory_usage	10/10

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

pkg/flare/archive.go

akarpz · 2024-10-03T19:28:58Z

out of curiosity, how does the data look once it's dumped into the text file? is it easy to inspect?

pimlu · 2024-10-03T19:53:44Z

It seems pretty readable, it looks like this:

<...>
network_tracer__closed_conns{ip_proto="TCP"} 36
network_tracer__dns__decoding_errors 6
network_tracer__dns_cache__size 0
network_tracer__ebpf__closed_conn_polling_received 36
network_tracer__ebpf__double_flush_attempts_close 25
network_tracer__ebpf__double_flush_attempts_done 0
network_tracer__ebpf__failed_conn_polling_lost 47
<...>

comp/core/flare/builder/builder.go

GustavoCaso · 2024-10-07T09:33:12Z

@pimlu from which Agent binary is intended to collect the system probe telemetry data? From the main agent or the security-agent?

Mostly to understand if the code logic should be place in pkg/flare/archive.go which is used for all Agent that implement the flare command or we should isolate only to the security agent.

Also, does the amount of imports from the change makes sense?

maybe we need to isolate some functions in individual packages to avoid that huge increase on imported packages

pkg/flare/processflare/archive_process.go

pimlu · 2024-10-08T17:53:15Z

@GustavoCaso I have updated this PR, it now no longer pulls in so many imports

GustavoCaso

Great work 🎉

pimlu · 2024-10-09T16:05:02Z

/merge

dd-devflow · 2024-10-09T16:05:15Z

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 26m.

Use /merge -c to cancel this operation!

robertjli · 2024-10-30T21:10:11Z

Removing team/processes, but please let me know if we do actually need a QA card for this

github-actions bot added team/processes team/networks labels Oct 3, 2024

pimlu marked this pull request as ready for review October 3, 2024 19:20

pimlu requested review from a team as code owners October 3, 2024 19:20

pimlu requested a review from a team October 3, 2024 19:20

pimlu requested review from a team as code owners October 3, 2024 19:20

pimlu requested a review from akarpz October 3, 2024 19:20

jhgilbert approved these changes Oct 3, 2024

View reviewed changes

akarpz reviewed Oct 3, 2024

View reviewed changes

pkg/flare/archive.go Outdated Show resolved Hide resolved

hmahmood approved these changes Oct 4, 2024

View reviewed changes

GustavoCaso reviewed Oct 7, 2024

View reviewed changes

comp/core/flare/builder/builder.go Outdated Show resolved Hide resolved

pimlu added 10 commits October 8, 2024 07:24

[NPM-3571] Include system-probe telemetry in agent flare

384d0e6

add to interface

f6650e6

lint

1f7d4a0

lint

0af1fb6

update sysprobeutil mockery

c3d9094

add windows constant, change to system_probe_telemetry.txt

8f21678

add release notes

5cc1b73

fix windows

873f8c1

rename file

94f8d85

partial feedback

4b994fe

hopefully fix agent imports

77da77d

pimlu force-pushed the stuart.geipel/agent-flare-telemetry branch from b8b30d4 to 77da77d Compare October 8, 2024 14:35

pimlu added 2 commits October 8, 2024 07:57

unused variable

1d0009a

add package comment

d62bca8

hmahmood reviewed Oct 8, 2024

View reviewed changes

pkg/flare/processflare/archive_process.go Outdated Show resolved Hide resolved

feedback

6378cf7

add comment

91019d5

clarkb7 approved these changes Oct 8, 2024

View reviewed changes

update docs

7dc7b6c

pimlu requested a review from GustavoCaso October 9, 2024 14:38

GustavoCaso approved these changes Oct 9, 2024

View reviewed changes

dd-mergequeue bot merged commit cb7a94e into main Oct 9, 2024
210 checks passed

dd-mergequeue bot deleted the stuart.geipel/agent-flare-telemetry branch October 9, 2024 16:29

github-actions bot added this to the 7.60.0 milestone Oct 9, 2024

pimlu mentioned this pull request Oct 24, 2024

[NPM-3663] Include BTF availability in agent flare #30440

Merged

robertjli removed the team/processes label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPM-3571] Include system-probe telemetry in agent flare #29768

[NPM-3571] Include system-probe telemetry in agent flare #29768

pimlu commented Oct 3, 2024 •

edited

Loading

cit-pr-commenter bot commented Oct 3, 2024 •

edited

Loading

pr-commenter bot commented Oct 3, 2024 •

edited by agent-platform-auto-pr bot

Loading

pr-commenter bot commented Oct 3, 2024 •

edited by cit-pr-commenter bot

Loading

Fine details of change detection per experiment

Explanation

akarpz commented Oct 3, 2024

pimlu commented Oct 3, 2024 •

edited

Loading

GustavoCaso commented Oct 7, 2024

pimlu commented Oct 8, 2024

GustavoCaso left a comment

pimlu commented Oct 9, 2024

dd-devflow bot commented Oct 9, 2024

robertjli commented Oct 30, 2024

[NPM-3571] Include system-probe telemetry in agent flare #29768

[NPM-3571] Include system-probe telemetry in agent flare #29768

Conversation

pimlu commented Oct 3, 2024 • edited Loading

What does this PR do?

Motivation

Describe how to test/QA your changes

Possible Drawbacks / Trade-offs

Additional Notes

cit-pr-commenter bot commented Oct 3, 2024 • edited Loading

Go Package Import Differences

pr-commenter bot commented Oct 3, 2024 • edited by agent-platform-auto-pr bot Loading

Test changes on VM

pr-commenter bot commented Oct 3, 2024 • edited by cit-pr-commenter bot Loading

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Bounds Checks

Explanation

akarpz commented Oct 3, 2024

pimlu commented Oct 3, 2024 • edited Loading

GustavoCaso commented Oct 7, 2024

pimlu commented Oct 8, 2024

GustavoCaso left a comment

Choose a reason for hiding this comment

pimlu commented Oct 9, 2024

dd-devflow bot commented Oct 9, 2024

robertjli commented Oct 30, 2024

pimlu commented Oct 3, 2024 •

edited

Loading

cit-pr-commenter bot commented Oct 3, 2024 •

edited

Loading

pr-commenter bot commented Oct 3, 2024 •

edited by agent-platform-auto-pr bot

Loading

pr-commenter bot commented Oct 3, 2024 •

edited by cit-pr-commenter bot

Loading

pimlu commented Oct 3, 2024 •

edited

Loading