Add agent diagnostics action #1703

michel-laterman · 2022-11-09T22:59:29Z

What does this PR do?

Add the ability for the agent to receive a diagnostics action. This action makes the agent take a diagnostics bundle and upload it to a new fleet-server file upload API. Changed internal details of how diagnostics bundles are taken to better support this:

Move the archive (zip) assembly from cmd/diagnostics.go into the diagnostics module so that running diagnostics as a command and as an action will produce the same binary.
log files are collected during zip assembly (unchanged, but moved)
diagnostics handler attempts to create the bundle in a temp file and falls back to in-memory buffer in case of an issue
Upload client has a built-in retry mechanism that will resend requests if a 429 is received.
Change how action ackevents are created (added a method to the action interface) in order to simplify the acker implementation

Why is it important?

In order to make it easier for customers to debug deployments we are adding the ability to gather diagnostics bundles in the fleet UI in Kibana.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run ES snapshot and Kibana with changes present in elastic/kibana#149575

Run a SNAPSHOT build of fleet-server (that has the file upload API)

Enroll a new agent in fleet.

Go to the diagnostics tab and request a new diagnostics bundle.

Related issues

Closes [Fleet] Support new "request diagnostics" action type fleet-server#1883

Add the ability for the agent to recieve a diagnostics action. This action makes the agent take a diagnostics bundle and upload it to a new fleet-server file upload API. Changed internal details of how diagnostics bundles are taken to better support this: Hooks are provided through a callback func so that log files generated as the agent is running can be included in the bundle. Change the Hook func signature from returning a []byte to a []byte, time.Time so that the file mod times collected through diagnostics hooks can be returned from the same filesystem interaction. Upload client has a built-in retry mechanism that will resend requests if a 429 is recieved.

mergify · 2022-11-09T23:00:06Z

This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

michel-laterman · 2022-11-09T23:02:42Z

Recreate #1631 now that v2 has been merged.

~~TODO:~~

~~Cleanup uploader once fleet-api is done: File Upload Feature fleet-server#1902~~
~~fleet-server needs to handle acks correctly~~

elasticmachine · 2022-11-09T23:13:04Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-01-31T18:31:16.272+0000
Duration: 16 min 48 sec

Test stats 🧪

Test	Results
Failed	0
Passed	4889
Skipped	13
Total	4902

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2022-11-16T23:44:15Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.333% (`59/60`)	👍 0.028
Files	69.082% (`143/207`)	👎 -0.186
Classes	68.514% (`272/397`)	👎 -0.717
Methods	53.836% (`835/1551`)	👎 -0.211
Lines	39.188% (`9097/23214`)	👎 -0.047
Conditionals	100.0% (`0/0`)	💚

Use the control.Client interface instead of a reference to the coordinator so that less boilerplate code is used. Fix callback setups. Cleanup uploader structs as the API has simplified a little.

Change the construction of AckEvent from being an internal method in the fleet.Acker to a method all Actions implement in order to reduce the acker complexity and increase the clarity of what is sent when an action is acked at the cost some code duplication.

internal/pkg/fleetapi/action.go

mergify · 2022-11-28T07:28:40Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b diagnostics-action upstream/diagnostics-action
git merge upstream/main
git push upstream diagnostics-action

michel-laterman · 2023-01-30T15:13:07Z

@juliaElastic, added a call to commit the ack, that should remove the delay

juliaElastic · 2023-01-30T15:52:55Z

added a call to commit the ack, that should remove the delay

@michel-laterman I tested with the change, but I don't see any change in the delay. Does this work for you locally?

## Summary Changed query of diagnostics files to speed up seeing the files. This is because the agent has a delay of about 4m to ack the action, this has to be fixed separately, see here elastic/elastic-agent#1703 (comment) Related to #141074 We can search for the diagnostics file by `agent_id` and `action_id`, so don't have to wait for the `upload_id` which comes from `.fleet-actions-results`. https://user-images.githubusercontent.com/90178898/215451881-bfaa9e86-e055-4490-87b1-dc1d1076a738.mov Displaying error from agent when diagnostics failed: <img width="839" alt="image" src="https://user-images.githubusercontent.com/90178898/215476207-5db7e935-28dd-432e-a6a6-195da162028a.png"> E.g. `.fleet-files-agent` ``` { "_index": ".fleet-files-agent-000001", "_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c.3a1f21b3-4559-4d3f-aae0-58356c269a92", "_score": null, "_source": { "action_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c", "agent_id": "3a1f21b3-4559-4d3f-aae0-58356c269a92", "contents": null, "file": { "ChunkSize": 4194304, "Status": "READY", "ext": "zip", "hash": { "md5": "", "sha256": "" }, "mime_type": "application/zip", "name": "elastic-agent-diagnostics-2023-01-30T10-13-33Z-00.zip", "size": 577178 }, "src": "agent", "upload_id": "988da8ad-9d92-4d18-b5b0-b2a7e77f5a81", "upload_start": 1675073615066, "transithash": { "sha256": "8a417cc8a73e32723ff449b603412113f319c7447044e81acab3f57d4e8226c8" } }, ``` Changed the style to be more consistent: <img width="898" alt="image" src="https://user-images.githubusercontent.com/90178898/215492173-7362fab7-15e6-4de9-824b-239164512231.png"> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

Diagnostics handler will write the bundle to a temp file on disk before reading chunks out to requests. This is done in an attempt to avoid loading a large file in memory. If the file creation on disk fails an in-memory buffer is used instead.

michel-laterman · 2023-01-31T01:36:09Z

internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go

+	// attempt to create the a temporary diagnostics file on disk in order to avoid loading a
+	// potentially large file in memory.
+	// if on-disk creation fails an in-memory buffer is used.
+	f, s, err := h.diagFile(aDiag, uDiag)


Using Amit's suggestion here to use an on-disk temp file and fall a back to an in-memory buffer.

Awesome work to turn around on that feedback so quickly. Really appreciate it.

cmacknz · 2023-01-31T03:00:41Z

OK, figured out why it takes a while for the diagnostics to show; tldr the ack is delayed (I believe due to batching)
Here's the fleet-server logs

Probably this is the same problem as #2410

blakerouse

Looks really good. I do think changing this to async should be done before merging.

internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go

blakerouse

Change looks good. Got a typo that needs to be fixed.

internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go

…_diagnostics.go Co-authored-by: Blake Rouse <[email protected]>

michel-laterman · 2023-01-31T18:01:37Z

/test

michel-laterman · 2023-01-31T18:31:02Z

/test

michel-laterman · 2023-01-31T18:56:04Z

I'll follow up with a docs pr

… to use upload_id (elastic#149575) ## Summary Closes elastic#141074 Enabled feature flag and tweaked implementation to find file by `upload_id` rather than doc id. How to test: - Start local kibana, start Fleet Server, enroll Elastic Agent from local (pull [these changes](elastic/elastic-agent#1703) ) - Click on Request Diagnostics action on the Agent - The diagnostics file should appear on Agent Details / Diagnostics tab. - The action should be completed on Agent activity <img width="1585" alt="image" src="https://user-images.githubusercontent.com/90178898/214805187-2b1abe34-ba7e-4612-9fad-7ef1f5942f47.png"> <img width="745" alt="image" src="https://user-images.githubusercontent.com/90178898/214805997-20fdaa01-e4c5-461c-b395-1b1e43117f8a.png"> The file metadata and binary can be queried from these indices: ``` GET .fleet-files-agent/_search GET .fleet-file-data-agent/_search ``` Tweaked the implementation so that the pending actions are showing up as soon as the `.fleet-actions` record is created (it can take several minutes until the action result is ready) Plus added a tooltip for error status <img width="948" alt="image" src="https://user-images.githubusercontent.com/90178898/214841337-eacbb1fc-4934-4d8b-9d52-8db4502d2493.png"> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

## Summary Changed query of diagnostics files to speed up seeing the files. This is because the agent has a delay of about 4m to ack the action, this has to be fixed separately, see here elastic/elastic-agent#1703 (comment) Related to elastic#141074 We can search for the diagnostics file by `agent_id` and `action_id`, so don't have to wait for the `upload_id` which comes from `.fleet-actions-results`. https://user-images.githubusercontent.com/90178898/215451881-bfaa9e86-e055-4490-87b1-dc1d1076a738.mov Displaying error from agent when diagnostics failed: <img width="839" alt="image" src="https://user-images.githubusercontent.com/90178898/215476207-5db7e935-28dd-432e-a6a6-195da162028a.png"> E.g. `.fleet-files-agent` ``` { "_index": ".fleet-files-agent-000001", "_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c.3a1f21b3-4559-4d3f-aae0-58356c269a92", "_score": null, "_source": { "action_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c", "agent_id": "3a1f21b3-4559-4d3f-aae0-58356c269a92", "contents": null, "file": { "ChunkSize": 4194304, "Status": "READY", "ext": "zip", "hash": { "md5": "", "sha256": "" }, "mime_type": "application/zip", "name": "elastic-agent-diagnostics-2023-01-30T10-13-33Z-00.zip", "size": 577178 }, "src": "agent", "upload_id": "988da8ad-9d92-4d18-b5b0-b2a7e77f5a81", "upload_start": 1675073615066, "transithash": { "sha256": "8a417cc8a73e32723ff449b603412113f319c7447044e81acab3f57d4e8226c8" } }, ``` Changed the style to be more consistent: <img width="898" alt="image" src="https://user-images.githubusercontent.com/90178898/215492173-7362fab7-15e6-4de9-824b-239164512231.png"> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

michel-laterman added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team labels Nov 9, 2022

mergify bot assigned michel-laterman Nov 9, 2022

mergify bot added the backport-skip label Nov 9, 2022

Merge remote-tracking branch 'origin/main' into diagnostics-action

b70f7c3

michel-laterman force-pushed the diagnostics-action branch 5 times, most recently from 97bff62 to 394db46 Compare November 16, 2022 23:43

michel-laterman force-pushed the diagnostics-action branch from 394db46 to e3ecfab Compare November 17, 2022 17:26

Fix PR and use control client instead of coord in handler

e9b59a2

Use the control.Client interface instead of a reference to the coordinator so that less boilerplate code is used. Fix callback setups. Cleanup uploader structs as the API has simplified a little.

michel-laterman force-pushed the diagnostics-action branch from e3ecfab to e9b59a2 Compare November 17, 2022 17:45

michel-laterman added 2 commits November 17, 2022 13:07

Fix linter and tests

ef7038a

michel-laterman commented Nov 18, 2022

View reviewed changes

internal/pkg/fleetapi/action.go Show resolved Hide resolved

Fix linter

5e2792a

michel-laterman added 4 commits December 12, 2022 08:04

Merge branch 'main' into diagnostics-action

8fa1bdc

Fix merge

267cb6c

Merge remote-tracking branch 'origin/main' into diagnostics-action

ee1f787

Fix tests

3a949f3

michel-laterman marked this pull request as ready for review December 12, 2022 19:42

michel-laterman requested a review from a team as a code owner December 12, 2022 19:42

fixed storing action_id correctly in files index

3f868d2

juliaElastic mentioned this pull request Jan 30, 2023

[Fleet] changed diagnostics query to speed up elastic/kibana#149812

Merged

1 task

kpollich mentioned this pull request Jan 30, 2023

[Fleet]: Diagnostics not available under agent diagnostics tab on requesting zip. elastic/kibana#149808

Closed

commit the ack

3adfdaa

michel-laterman requested a review from blakerouse January 30, 2023 15:44

Diagnostics handler will use temp file

8d96c53

Diagnostics handler will write the bundle to a temp file on disk before reading chunks out to requests. This is done in an attempt to avoid loading a large file in memory. If the file creation on disk fails an in-memory buffer is used instead.

michel-laterman commented Jan 31, 2023

View reviewed changes

blakerouse requested changes Jan 31, 2023

View reviewed changes

internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go Show resolved Hide resolved

internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go Outdated Show resolved Hide resolved

michel-laterman added 2 commits January 31, 2023 08:40

Change to async handler, add panic recover

cc1cd33

fix linter

f97d6d7

blakerouse approved these changes Jan 31, 2023

View reviewed changes

internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go Outdated Show resolved Hide resolved

michel-laterman and others added 2 commits January 31, 2023 09:29

Update internal/pkg/agent/application/actions/handlers/handler_action…

cfa3e96

…_diagnostics.go Co-authored-by: Blake Rouse <[email protected]>

build error out of recovered item

09293ff

michel-laterman merged commit 9fd2055 into elastic:main Jan 31, 2023

michel-laterman deleted the diagnostics-action branch January 31, 2023 18:55

michel-laterman mentioned this pull request Jan 31, 2023

Add section to monitoring page about agent diagnostics elastic/observability-docs#2579

Merged

juliaElastic mentioned this pull request Feb 6, 2023

[Fleet] Restrict Request diagnostics action to agents with version >= 8.7 elastic/kibana#150300

Closed

cmacknz mentioned this pull request Feb 14, 2023

Add the file extension in unit diagnostics again. #2270

Merged

juliaElastic mentioned this pull request Mar 6, 2023

V2 Agent actions acks/results are delayed by 5 minutes #2410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agent diagnostics action #1703

Add agent diagnostics action #1703

michel-laterman commented Nov 9, 2022 •

edited

Loading

mergify bot commented Nov 9, 2022

michel-laterman commented Nov 9, 2022 •

edited

Loading

elasticmachine commented Nov 9, 2022 •

edited

Loading

Build stats

Test stats 🧪

elasticmachine commented Nov 16, 2022 •

edited

Loading

mergify bot commented Nov 28, 2022

michel-laterman commented Jan 30, 2023

juliaElastic commented Jan 30, 2023

michel-laterman Jan 31, 2023

kpollich Jan 31, 2023

cmacknz commented Jan 31, 2023

blakerouse left a comment

blakerouse left a comment

michel-laterman commented Jan 31, 2023

michel-laterman commented Jan 31, 2023

michel-laterman commented Jan 31, 2023

Add agent diagnostics action #1703

Add agent diagnostics action #1703

Conversation

michel-laterman commented Nov 9, 2022 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

mergify bot commented Nov 9, 2022

michel-laterman commented Nov 9, 2022 • edited Loading

elasticmachine commented Nov 9, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Nov 16, 2022 • edited Loading

🌐 Coverage report

mergify bot commented Nov 28, 2022

michel-laterman commented Jan 30, 2023

juliaElastic commented Jan 30, 2023

michel-laterman Jan 31, 2023

Choose a reason for hiding this comment

kpollich Jan 31, 2023

Choose a reason for hiding this comment

cmacknz commented Jan 31, 2023

blakerouse left a comment

Choose a reason for hiding this comment

blakerouse left a comment

Choose a reason for hiding this comment

michel-laterman commented Jan 31, 2023

michel-laterman commented Jan 31, 2023

michel-laterman commented Jan 31, 2023

michel-laterman commented Nov 9, 2022 •

edited

Loading

michel-laterman commented Nov 9, 2022 •

edited

Loading

elasticmachine commented Nov 9, 2022 •

edited

Loading

elasticmachine commented Nov 16, 2022 •

edited

Loading