Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add agent diagnostics action #1703

Merged
merged 40 commits into from
Jan 31, 2023

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Nov 9, 2022

What does this PR do?

Add the ability for the agent to receive a diagnostics action. This action makes the agent take a diagnostics bundle and upload it to a new fleet-server file upload API. Changed internal details of how diagnostics bundles are taken to better support this:

  • Move the archive (zip) assembly from cmd/diagnostics.go into the diagnostics module so that running diagnostics as a command and as an action will produce the same binary.
  • log files are collected during zip assembly (unchanged, but moved)
  • diagnostics handler attempts to create the bundle in a temp file and falls back to in-memory buffer in case of an issue
  • Upload client has a built-in retry mechanism that will resend requests if a 429 is received.
  • Change how action ackevents are created (added a method to the action interface) in order to simplify the acker implementation

Why is it important?

In order to make it easier for customers to debug deployments we are adding the ability to gather diagnostics bundles in the fleet UI in Kibana.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run ES snapshot and Kibana with changes present in elastic/kibana#149575

Run a SNAPSHOT build of fleet-server (that has the file upload API)

Enroll a new agent in fleet.

Go to the diagnostics tab and request a new diagnostics bundle.

Related issues

Add the ability for the agent to recieve a diagnostics action. This
action makes the agent take a diagnostics bundle and upload it to a new
fleet-server file upload API. Changed internal details of how
diagnostics bundles are taken to better support this:

Hooks are provided through a callback func so that log files generated
as the agent is running can be included in the bundle. Change the Hook
func signature from returning a []byte to a []byte, time.Time so that
the file mod times collected through diagnostics hooks can be returned
from the same filesystem interaction. Upload client has a built-in
retry mechanism that will resend requests if a 429 is recieved.
@michel-laterman michel-laterman added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team labels Nov 9, 2022
@mergify
Copy link
Contributor

mergify bot commented Nov 9, 2022

This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip label Nov 9, 2022
@michel-laterman
Copy link
Contributor Author

michel-laterman commented Nov 9, 2022

Recreate #1631 now that v2 has been merged.

TODO:

@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 9, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-01-31T18:31:16.272+0000

  • Duration: 16 min 48 sec

Test stats 🧪

Test Results
Failed 0
Passed 4889
Skipped 13
Total 4902

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@michel-laterman michel-laterman force-pushed the diagnostics-action branch 5 times, most recently from 97bff62 to 394db46 Compare November 16, 2022 23:43
@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 16, 2022

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.333% (59/60) 👍 0.028
Files 69.082% (143/207) 👎 -0.186
Classes 68.514% (272/397) 👎 -0.717
Methods 53.836% (835/1551) 👎 -0.211
Lines 39.188% (9097/23214) 👎 -0.047
Conditionals 100.0% (0/0) 💚

Use the control.Client interface instead of a reference to the
coordinator so that less boilerplate code is used. Fix callback setups.
Cleanup uploader structs as the API has simplified a little.
Change the construction of AckEvent from being an internal method in the
fleet.Acker to a method all Actions implement in order to reduce the
acker complexity and increase the clarity of what is sent when an action
is acked at the cost some code duplication.
@mergify
Copy link
Contributor

mergify bot commented Nov 28, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b diagnostics-action upstream/diagnostics-action
git merge upstream/main
git push upstream diagnostics-action

@michel-laterman michel-laterman marked this pull request as ready for review December 12, 2022 19:42
@michel-laterman michel-laterman requested a review from a team as a code owner December 12, 2022 19:42
@michel-laterman
Copy link
Contributor Author

@juliaElastic, added a call to commit the ack, that should remove the delay

@juliaElastic
Copy link
Contributor

added a call to commit the ack, that should remove the delay

@michel-laterman I tested with the change, but I don't see any change in the delay. Does this work for you locally?

juliaElastic added a commit to elastic/kibana that referenced this pull request Jan 30, 2023
## Summary

Changed query of diagnostics files to speed up seeing the files. This is
because the agent has a delay of about 4m to ack the action, this has to
be fixed separately, see here
elastic/elastic-agent#1703 (comment)

Related to #141074

We can search for the diagnostics file by `agent_id` and `action_id`, so
don't have to wait for the `upload_id` which comes from
`.fleet-actions-results`.


https://user-images.githubusercontent.com/90178898/215451881-bfaa9e86-e055-4490-87b1-dc1d1076a738.mov

Displaying error from agent when diagnostics failed:

<img width="839" alt="image"
src="https://user-images.githubusercontent.com/90178898/215476207-5db7e935-28dd-432e-a6a6-195da162028a.png">


E.g. `.fleet-files-agent`

```
{
        "_index": ".fleet-files-agent-000001",
        "_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c.3a1f21b3-4559-4d3f-aae0-58356c269a92",
        "_score": null,
        "_source": {
          "action_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c",
          "agent_id": "3a1f21b3-4559-4d3f-aae0-58356c269a92",
          "contents": null,
          "file": {
            "ChunkSize": 4194304,
            "Status": "READY",
            "ext": "zip",
            "hash": {
              "md5": "",
              "sha256": ""
            },
            "mime_type": "application/zip",
            "name": "elastic-agent-diagnostics-2023-01-30T10-13-33Z-00.zip",
            "size": 577178
          },
          "src": "agent",
          "upload_id": "988da8ad-9d92-4d18-b5b0-b2a7e77f5a81",
          "upload_start": 1675073615066,
          "transithash": {
            "sha256": "8a417cc8a73e32723ff449b603412113f319c7447044e81acab3f57d4e8226c8"
          }
        },
```

Changed the style to be more consistent:

<img width="898" alt="image"
src="https://user-images.githubusercontent.com/90178898/215492173-7362fab7-15e6-4de9-824b-239164512231.png">



### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
Diagnostics handler will write the bundle to a temp file on disk before
reading chunks out to requests. This is done in an attempt to avoid
loading a large file in memory. If the file creation on disk fails an
in-memory buffer is used instead.
Comment on lines +85 to +88
// attempt to create the a temporary diagnostics file on disk in order to avoid loading a
// potentially large file in memory.
// if on-disk creation fails an in-memory buffer is used.
f, s, err := h.diagFile(aDiag, uDiag)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Amit's suggestion here to use an on-disk temp file and fall a back to an in-memory buffer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work to turn around on that feedback so quickly. Really appreciate it.

@cmacknz
Copy link
Member

cmacknz commented Jan 31, 2023

OK, figured out why it takes a while for the diagnostics to show; tldr the ack is delayed (I believe due to batching)
Here's the fleet-server logs

Probably this is the same problem as #2410

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. I do think changing this to async should be done before merging.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good. Got a typo that needs to be fixed.

@michel-laterman
Copy link
Contributor Author

/test

1 similar comment
@michel-laterman
Copy link
Contributor Author

/test

@michel-laterman michel-laterman merged commit 9fd2055 into elastic:main Jan 31, 2023
@michel-laterman michel-laterman deleted the diagnostics-action branch January 31, 2023 18:55
@michel-laterman
Copy link
Contributor Author

I'll follow up with a docs pr

kqualters-elastic pushed a commit to kqualters-elastic/kibana that referenced this pull request Feb 6, 2023
… to use upload_id (elastic#149575)

## Summary

Closes elastic#141074

Enabled feature flag and tweaked implementation to find file by
`upload_id` rather than doc id.

How to test:
- Start local kibana, start Fleet Server, enroll Elastic Agent from
local (pull [these
changes](elastic/elastic-agent#1703) )
- Click on Request Diagnostics action on the Agent
- The diagnostics file should appear on Agent Details / Diagnostics tab.
- The action should be completed on Agent activity

<img width="1585" alt="image"
src="https://user-images.githubusercontent.com/90178898/214805187-2b1abe34-ba7e-4612-9fad-7ef1f5942f47.png">
<img width="745" alt="image"
src="https://user-images.githubusercontent.com/90178898/214805997-20fdaa01-e4c5-461c-b395-1b1e43117f8a.png">

The file metadata and binary can be queried from these indices:

```
GET .fleet-files-agent/_search

GET .fleet-file-data-agent/_search
```

Tweaked the implementation so that the pending actions are showing up as
soon as the `.fleet-actions` record is created (it can take several
minutes until the action result is ready)
Plus added a tooltip for error status

<img width="948" alt="image"
src="https://user-images.githubusercontent.com/90178898/214841337-eacbb1fc-4934-4d8b-9d52-8db4502d2493.png">



### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
kqualters-elastic pushed a commit to kqualters-elastic/kibana that referenced this pull request Feb 6, 2023
## Summary

Changed query of diagnostics files to speed up seeing the files. This is
because the agent has a delay of about 4m to ack the action, this has to
be fixed separately, see here
elastic/elastic-agent#1703 (comment)

Related to elastic#141074

We can search for the diagnostics file by `agent_id` and `action_id`, so
don't have to wait for the `upload_id` which comes from
`.fleet-actions-results`.


https://user-images.githubusercontent.com/90178898/215451881-bfaa9e86-e055-4490-87b1-dc1d1076a738.mov

Displaying error from agent when diagnostics failed:

<img width="839" alt="image"
src="https://user-images.githubusercontent.com/90178898/215476207-5db7e935-28dd-432e-a6a6-195da162028a.png">


E.g. `.fleet-files-agent`

```
{
        "_index": ".fleet-files-agent-000001",
        "_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c.3a1f21b3-4559-4d3f-aae0-58356c269a92",
        "_score": null,
        "_source": {
          "action_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c",
          "agent_id": "3a1f21b3-4559-4d3f-aae0-58356c269a92",
          "contents": null,
          "file": {
            "ChunkSize": 4194304,
            "Status": "READY",
            "ext": "zip",
            "hash": {
              "md5": "",
              "sha256": ""
            },
            "mime_type": "application/zip",
            "name": "elastic-agent-diagnostics-2023-01-30T10-13-33Z-00.zip",
            "size": 577178
          },
          "src": "agent",
          "upload_id": "988da8ad-9d92-4d18-b5b0-b2a7e77f5a81",
          "upload_start": 1675073615066,
          "transithash": {
            "sha256": "8a417cc8a73e32723ff449b603412113f319c7447044e81acab3f57d4e8226c8"
          }
        },
```

Changed the style to be more consistent:

<img width="898" alt="image"
src="https://user-images.githubusercontent.com/90178898/215492173-7362fab7-15e6-4de9-824b-239164512231.png">



### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Fleet] Support new "request diagnostics" action type
10 participants