-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add agent diagnostics action #1703
Add agent diagnostics action #1703
Conversation
Add the ability for the agent to recieve a diagnostics action. This action makes the agent take a diagnostics bundle and upload it to a new fleet-server file upload API. Changed internal details of how diagnostics bundles are taken to better support this: Hooks are provided through a callback func so that log files generated as the agent is running can be included in the bundle. Change the Hook func signature from returning a []byte to a []byte, time.Time so that the file mod times collected through diagnostics hooks can be returned from the same filesystem interaction. Upload client has a built-in retry mechanism that will resend requests if a 429 is recieved.
This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
NOTE: |
Recreate #1631 now that v2 has been merged.
|
97bff62
to
394db46
Compare
🌐 Coverage report
|
394db46
to
e3ecfab
Compare
Use the control.Client interface instead of a reference to the coordinator so that less boilerplate code is used. Fix callback setups. Cleanup uploader structs as the API has simplified a little.
e3ecfab
to
e9b59a2
Compare
Change the construction of AckEvent from being an internal method in the fleet.Acker to a method all Actions implement in order to reduce the acker complexity and increase the clarity of what is sent when an action is acked at the cost some code duplication.
This pull request is now in conflicts. Could you fix it? 🙏
|
@juliaElastic, added a call to commit the ack, that should remove the delay |
@michel-laterman I tested with the change, but I don't see any change in the delay. Does this work for you locally? |
## Summary Changed query of diagnostics files to speed up seeing the files. This is because the agent has a delay of about 4m to ack the action, this has to be fixed separately, see here elastic/elastic-agent#1703 (comment) Related to #141074 We can search for the diagnostics file by `agent_id` and `action_id`, so don't have to wait for the `upload_id` which comes from `.fleet-actions-results`. https://user-images.githubusercontent.com/90178898/215451881-bfaa9e86-e055-4490-87b1-dc1d1076a738.mov Displaying error from agent when diagnostics failed: <img width="839" alt="image" src="https://user-images.githubusercontent.com/90178898/215476207-5db7e935-28dd-432e-a6a6-195da162028a.png"> E.g. `.fleet-files-agent` ``` { "_index": ".fleet-files-agent-000001", "_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c.3a1f21b3-4559-4d3f-aae0-58356c269a92", "_score": null, "_source": { "action_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c", "agent_id": "3a1f21b3-4559-4d3f-aae0-58356c269a92", "contents": null, "file": { "ChunkSize": 4194304, "Status": "READY", "ext": "zip", "hash": { "md5": "", "sha256": "" }, "mime_type": "application/zip", "name": "elastic-agent-diagnostics-2023-01-30T10-13-33Z-00.zip", "size": 577178 }, "src": "agent", "upload_id": "988da8ad-9d92-4d18-b5b0-b2a7e77f5a81", "upload_start": 1675073615066, "transithash": { "sha256": "8a417cc8a73e32723ff449b603412113f319c7447044e81acab3f57d4e8226c8" } }, ``` Changed the style to be more consistent: <img width="898" alt="image" src="https://user-images.githubusercontent.com/90178898/215492173-7362fab7-15e6-4de9-824b-239164512231.png"> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Diagnostics handler will write the bundle to a temp file on disk before reading chunks out to requests. This is done in an attempt to avoid loading a large file in memory. If the file creation on disk fails an in-memory buffer is used instead.
// attempt to create the a temporary diagnostics file on disk in order to avoid loading a | ||
// potentially large file in memory. | ||
// if on-disk creation fails an in-memory buffer is used. | ||
f, s, err := h.diagFile(aDiag, uDiag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Amit's suggestion here to use an on-disk temp file and fall a back to an in-memory buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work to turn around on that feedback so quickly. Really appreciate it.
Probably this is the same problem as #2410 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. I do think changing this to async should be done before merging.
internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go
Show resolved
Hide resolved
internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good. Got a typo that needs to be fixed.
internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go
Outdated
Show resolved
Hide resolved
…_diagnostics.go Co-authored-by: Blake Rouse <[email protected]>
/test |
1 similar comment
/test |
I'll follow up with a docs pr |
… to use upload_id (elastic#149575) ## Summary Closes elastic#141074 Enabled feature flag and tweaked implementation to find file by `upload_id` rather than doc id. How to test: - Start local kibana, start Fleet Server, enroll Elastic Agent from local (pull [these changes](elastic/elastic-agent#1703) ) - Click on Request Diagnostics action on the Agent - The diagnostics file should appear on Agent Details / Diagnostics tab. - The action should be completed on Agent activity <img width="1585" alt="image" src="https://user-images.githubusercontent.com/90178898/214805187-2b1abe34-ba7e-4612-9fad-7ef1f5942f47.png"> <img width="745" alt="image" src="https://user-images.githubusercontent.com/90178898/214805997-20fdaa01-e4c5-461c-b395-1b1e43117f8a.png"> The file metadata and binary can be queried from these indices: ``` GET .fleet-files-agent/_search GET .fleet-file-data-agent/_search ``` Tweaked the implementation so that the pending actions are showing up as soon as the `.fleet-actions` record is created (it can take several minutes until the action result is ready) Plus added a tooltip for error status <img width="948" alt="image" src="https://user-images.githubusercontent.com/90178898/214841337-eacbb1fc-4934-4d8b-9d52-8db4502d2493.png"> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
## Summary Changed query of diagnostics files to speed up seeing the files. This is because the agent has a delay of about 4m to ack the action, this has to be fixed separately, see here elastic/elastic-agent#1703 (comment) Related to elastic#141074 We can search for the diagnostics file by `agent_id` and `action_id`, so don't have to wait for the `upload_id` which comes from `.fleet-actions-results`. https://user-images.githubusercontent.com/90178898/215451881-bfaa9e86-e055-4490-87b1-dc1d1076a738.mov Displaying error from agent when diagnostics failed: <img width="839" alt="image" src="https://user-images.githubusercontent.com/90178898/215476207-5db7e935-28dd-432e-a6a6-195da162028a.png"> E.g. `.fleet-files-agent` ``` { "_index": ".fleet-files-agent-000001", "_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c.3a1f21b3-4559-4d3f-aae0-58356c269a92", "_score": null, "_source": { "action_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c", "agent_id": "3a1f21b3-4559-4d3f-aae0-58356c269a92", "contents": null, "file": { "ChunkSize": 4194304, "Status": "READY", "ext": "zip", "hash": { "md5": "", "sha256": "" }, "mime_type": "application/zip", "name": "elastic-agent-diagnostics-2023-01-30T10-13-33Z-00.zip", "size": 577178 }, "src": "agent", "upload_id": "988da8ad-9d92-4d18-b5b0-b2a7e77f5a81", "upload_start": 1675073615066, "transithash": { "sha256": "8a417cc8a73e32723ff449b603412113f319c7447044e81acab3f57d4e8226c8" } }, ``` Changed the style to be more consistent: <img width="898" alt="image" src="https://user-images.githubusercontent.com/90178898/215492173-7362fab7-15e6-4de9-824b-239164512231.png"> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
What does this PR do?
Add the ability for the agent to receive a diagnostics action. This action makes the agent take a diagnostics bundle and upload it to a new fleet-server file upload API. Changed internal details of how diagnostics bundles are taken to better support this:
Why is it important?
In order to make it easier for customers to debug deployments we are adding the ability to gather diagnostics bundles in the fleet UI in Kibana.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Run ES snapshot and Kibana with changes present in elastic/kibana#149575
Run a SNAPSHOT build of
fleet-server
(that has the file upload API)Enroll a new agent in fleet.
Go to the diagnostics tab and request a new diagnostics bundle.
Related issues