Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Add workflow for requesting and downloading agent diagnostics from Fleet UI #141074

Closed
19 tasks done
kpollich opened this issue Sep 20, 2022 · 25 comments · Fixed by #142369 or #149575
Closed
19 tasks done

[Fleet] Add workflow for requesting and downloading agent diagnostics from Fleet UI #141074

kpollich opened this issue Sep 20, 2022 · 25 comments · Fixed by #142369 or #149575
Assignees
Labels
QA:Validated Issue has been validated by QA Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@kpollich
Copy link
Member

kpollich commented Sep 20, 2022

Blocked by https://github.com/elastic/security-team/issues/4661

Background

A common supportability concern with Fleet/Agent is the collection and investigation of diagnostics. Elastic Agent exposes the elastic-agent diagnostics collect command, which outputs a .zip containing various diagnostics information that's crucial for debugging purposes.

We'd like to expose these diagnostics files in Fleet UI to improve debug-ability and reduce support overhead when requesting these diagnostics.

There are a few components at play here

  • Fleet Server's "file upload" API that will allow agents to upload their diagnostics files (or any other arbitrary files) to Kibana. Tracked in https://github.com/elastic/security-team/issues/4661 and slotted for delivery during 8.6.
  • A new UPLOAD_DIAGNOSTICS (name not final) action type for initiating the collection -> upload of Agent diagnostics
  • A Fleet UI workflow for requesting diagnostics, being notified of upload completion, and viewing previously requested diagnostics (this issue)

Implementation

  • API requirements
    • Support creating the REQUEST_DIAGNOSTICS action type in the existing POST /api/fleet/agents/<id>/action API
    • Add a new API for listing the files an agent has uploaded: GET /api/fleet/agents/<id>/uploads
    • Use the Kibana File Storage HTTP API for downloading: GET /api/files/files/<id>/blob[/<filename>]
  • Agent Details Page - /agents/:id
    • Add a new action to the "actions" dropdown: "Request diagnostics .zip"
    • Clicking "Request diagnostics" creates a new REQUEST_DIAGANOSTICS action and navigates the user to a new "Diagnostics" tab on the agent details page
    • Diagnostics tab
      • A list of diagnostics files for the current agent is displayed as a table with two columns: File and Date
        • Each file name is prefixed with a "download" icon, and clicking the file name downloads the file using the browser's standard file download process
        • Any pending uploads (e.g action is created, but not yet complete) are displayed with a filename of "Generating diagnostics" and a date corresponding to the action's creation
        • Any errored uploads (e.g. action reports an error status) are displayed with a an error icon and a derived timestamp filename based on the action's creation date
      • A "Request diagnostics .zip" button exists to initiate (create action) a new diagnostics upload. The table dynamically updates to include this new request.
  • Agent Listing Page - /agents
    • Add a new "Request diagnostics .zip" action to each agent's action menu that creates the action record -> navigates the user to the Diagnostics tab for the selected agent as above
    • Requesting diagnostics for multiple agents at once should be possible through the bulk actions menu when multiple agents are selected
  • Global functionality
    • When a diagnostics upload is completed, the user should be notified via a toast message. This can probably be implemented by polling for the action's status for a set amount of time after diagnostics are requested.

Demo

  • Prepare recorded demo

Designs

Overview:
image

Show individual screens

Agent details screen:
image

Diagnostics tab:
image

image

Agent listing page
image

@kpollich kpollich added the Team:Fleet Team label for Observability Data Collection Fleet team label Sep 20, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@kpollich
Copy link
Member Author

Requesting diagnostics for multiple agents at once should be possible through the bulk actions menu when multiple agents are selected

I've added this requirement based on an internal conversation w/ support where this was identified as a big win. @juliaElastic or @michel-laterman do you foresee any issues with "queuing up" diagnostics requests for multiple agents via .fleet-actions? The way I imagine this working is that we'll create an action record for each agent ID with type REQUEST_DIAGNOSTICS and we should process them eventually. Any concerns with that approach?

@michel-laterman
Copy link

I don't see an issue with this.
But just to clarify, an action document has the agents: [] attribute, so we can give the same action to multiple agents.

@kpollich
Copy link
Member Author

an action document has the agents: [] attribute, so we can give the same action to multiple agents.

Even better. Thanks for clarifying! This should make the process of requesting diagnostics from many agents simpler.

@juliaElastic
Copy link
Contributor

Yes, this can work the same way as other bulk actions.

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 30, 2022

@kpollich I think the bulk action of Request Diagnostics is more complex than we assumed:

For bulk selection, it doesn't sound logical to navigate to the Agent details page of one agent.
Alternatively there could be a new screen to list the diagnostics of bulk actions or show the file for each agent details that was included in the bulk action. In the latter case it wouldn't be easy to find all the files that were generated as part of one bulk action.
Related question, is the backend implementation planned to produce one zip per agent for bulk action?

There is also a question on the size of files the bulk action would produce, are there any concerns of uploading hundreds of Mb per each agent? This could quickly reach a very big file size if actioned on large agent selections.

@kpollich
Copy link
Member Author

For bulk selection, it doesn't sound logical to navigate to the Agent details page of one agent.

I agree with this. We'll probably want to do something else after a bulk "request diagnostics" action is created.

Alternatively there could be a new screen to list the diagnostics of bulk actions or show the file for each agent details that was included in the bulk action. In the latter case it wouldn't be easy to find all the files that were generated as part of one bulk action.

Could we open the "agent activity" flyout after this bulk action is created to show the pending action w/ some detail about its status? Maybe we can display an expandable section where each agent is listed w/ a link to its /diagnostics tab?

Related question, is the backend implementation planned to produce one zip per agent for bulk action?

Yes there'd be one zip per agent.

There is also a question on the size of files the bulk action would produce, are there any concerns of uploading hundreds of Mb per each agent? This could quickly reach a very big file size if actioned on large agent selections.

This is true, and users should be aware that diagnostics incur storage costs + storage costs incur monetary costs on cloud. We may want to document this in a callout on the diagnostics tab. Something like Diagnostics files are stored in Elasticsearch, and as such can incur storage costs. Fleet will automatically remove old diagnostics files after 30 days.

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 30, 2022

Could we open the "agent activity" flyout after this bulk action is created to show the pending action w/ some detail about its status? Maybe we can display an expandable section where each agent is listed w/ a link to its /diagnostics tab?

Good idea, though showing a link for each agent would only work for a limited number of agents.
When we implement the View agents functionality of the flyout, we would have a way to navigate to agent details from the actioned agent list as well.

This is true, and users should be aware that diagnostics incur storage costs + storage costs incur monetary costs on cloud. We may want to document this in a callout on the diagnostics tab. Something like Diagnostics files are stored in Elasticsearch, and as such can incur storage costs. Fleet will automatically remove old diagnostics files after 30 days.

Yes, we can do that. Even a confirmation window could be added with a warning message.

@kpollich
Copy link
Member Author

Good idea, though showing a link for each agent would only work for a limited number of agents.

When we implement the #141206 functionality of the flyout, we would have a way to navigate to agent details from the actioned agent list as well.

Good points here. For bulk actions, let's just opt not to open the activity flyout once the action is created then. Eventually, we'll enhance the flyout with some more granular info about each agent for which an action was created. For now though, I think just showing the status of the "request bulk diagnostics" operation is good enough. The user will likely dismiss the flyout and visit the agents individually to access diagnostics afterwards.

@kpollich
Copy link
Member Author

I'm realizing we don't have expiry captured anywhere here. We'll want to create a new index for the uploaded files to be stored in, and create an ILM policy during Fleet setup to manage it.

One thing I'm not clear on is whether we need to add to https://github.com/elastic/elasticsearch/tree/main/x-pack/plugin/core/src/main/resources (see .fleet-actions example) for this index, or if the file upload functionality will expose some kind of API for creating these "upload destination" indices. @pzl maybe you could answer this or point us in the right direction based on your work over on https://github.com/elastic/security-team/issues/4661?

@juliaElastic
Copy link
Contributor

juliaElastic commented Oct 3, 2022

@kpollich We discussed on the call today that there will be different indices needed for fleet and endpoint security (and potentially other integrations in the future).
I think it makes sense to create and manage the fleet index in fleet setup code, including the ILM policy.

@juliaElastic
Copy link
Contributor

juliaElastic commented Oct 4, 2022

@kpollich @joshdover @paul-tavares
I used @pzl 's pr to have some mock files uploaded: elastic/fleet-server#1902
Currently this pr creates .fleet-files and .fleet-file_data indices to store the file metadata and blob.

I tested the Kibana File Service to query files from these indices, and got this error:
Elasticsearch blob store index name must start with ".kibana", got .fleet-file_data.

Is this a known limitation of Kibana File Service? Are we expected to use kibana prefix that impacts the privileges?

When I tried to put .kibana prefix in front of index name, I am getting an auth error when trying to upload a file:

juliabardi@Julias-MacBook-Pro ~ % go run ./upload.go elastic-agent-diagnostics-2022-10-04T09-54-34Z-00.zip
error API response. Status code 400:
{"statusCode":400,"error":"BadRequest","message":"elastic fail 403: security_exception: action [indices:data/write/bulk[s]] is unauthorized for service account [elastic/fleet-server] on indices [.kibana-fleet-files], this action is granted by the index privileges [create_doc,create,delete,index,write,all]"}exit status 1

Here is the code that I used:

import { createEsFileClient } from '@kbn/files-plugin/server';

    const fileClient = createEsFileClient({
      blobStorageIndex: '.fleet-file_data',
      metadataIndex: '.fleet-files',
      elasticsearchClient: esClient,
      logger: appContextService.getLogger(),
  });

  const results = await fileClient.list();

EDIT: the kibana blocker is resolved now, managed to get the download working, see in pr description: #142369

@juliaElastic
Copy link
Contributor

When a diagnostics upload is completed, the user should be notified via a toast message. This can probably be implemented by polling for the action's status for a set amount of time after diagnostics are requested.

@kpollich Do we mean to show the toast message when the user is on the Agent details / Diagnostics tab, or anywhere else in Fleet?
I added to the Diagnostics tab component, because that already polls the information about the diagnostics states, and the user is navigated to this tab after taking the action (except for the bulk action)

@kpollich
Copy link
Member Author

Do we mean to show the toast message when the user is on the Agent details / Diagnostics tab, or anywhere else in Fleet?

I would expect the toast to function in the same "async global" way that package installation toasts work. I think this is implemented using the Kibana notifications service?

@juliaElastic
Copy link
Contributor

I would expect the toast to function in the same "async global" way that package installation toasts work. I think this is implemented using the Kibana notifications service?

Yes, I am using kibana notifications service. The question referred more on whether it is okay to do the polling when the user is on Diagnostics tab, or do we want the polling in the background also when they navigate away?

@kpollich
Copy link
Member Author

Yes, I am using kibana notifications service. The question referred more on whether it is okay to do the polling when the user is on Diagnostics tab, or do we want the polling in the background also when they navigate away?

Got it - thank you for clarifying. Let's just keep the polling on the diagnostics tab for now.

@kpollich
Copy link
Member Author

Blocked by #143459

juliaElastic added a commit that referenced this issue Nov 10, 2022
## Summary

Closes #141074

### Request diagnostics action
Added new action for single agent (Agent details page and Agent list row
actions) to request diagnostics.
When clicking on the action, an API request is made that creates a
`REQUEST_DIAGNOSTICS` type action in `.fleet-actions` index.

### Diagnostics uploads display
When the action is submitted, the user is navigated to the new `Agent
Details / Diagnostics` tab, which shows the list of pending and
completed diagnostics file uploads. The information is coming from the
`/action_status` (for action status) as well as the `/uploads` endpoint
(for file name and path)
By clicking on a diagnostics link, the file should be downloaded in zip.

<img width="1060" alt="image"
src="https://user-images.githubusercontent.com/90178898/193816708-803c2a22-d421-4af2-9a78-785cdee81136.png">

Failed uploads display:
<img width="638" alt="image"
src="https://user-images.githubusercontent.com/90178898/194058366-d4874339-9fd1-419e-99e5-f592a6b3bf6d.png">
Expired status was not specified in the design separately, it will be
shown like the failed status (with warning icon).

### Mock data (blocker)
Currently returning mock data in the `/uploads` API, because of a
blocker in Kibana File Service, see
[here](#141074 (comment)).

### Bulk action
Added bulk action too:
<img width="1759" alt="image"
src="https://user-images.githubusercontent.com/90178898/194026861-bf0d5956-de2d-4d2b-895a-c35cf5252a5a.png">

Shows up in agent activity:
<img width="594" alt="image"
src="https://user-images.githubusercontent.com/90178898/194026960-356a5b40-1203-4182-ad7b-89b1432bf0f6.png">

The Fleet Server / Agent changes are not there yet, though FS delivers
the action, and Agents ack it (looks like default behavior for unkown
actions as well)

### Confirmation modal

Added a confirmation modal when clicking on action button everywhere,
except for the `Request diagnostics` button on the Diagnostics page.
Open question:
- Do we want to display the confirmation window on the Diagnostics page
button too?

<img width="673" alt="image"
src="https://user-images.githubusercontent.com/90178898/194065175-715b158e-0628-4bd9-86db-920c1ec9825e.png">

### Download

Generated file path to download in this format:
`/api/fleet/agents/files/{fileId}/{fileName}`

Decided not to try to use `files` plugin's API because it doesn't have
the Fleet authorization around it.

Screen recording demonstrating the download of an agent diagnostics zip
file, that I uploaded using the Fleet Server upload API (using [Dan's
pr](elastic/fleet-server#1902) locally)



https://user-images.githubusercontent.com/90178898/194287842-c7f09c9e-5310-460f-9cae-6fc7fa7750de.mov

### Notification

Added toast message to show up when a diagnostics becomes ready, when we
are on the Diagnostics tab.



https://user-images.githubusercontent.com/90178898/194318170-e7ec66db-8bf8-4535-b07e-682397c2920c.mov



### Checklist

Delete any items that are not applicable to this PR.

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <[email protected]>
@juliaElastic
Copy link
Contributor

Reopening as there is a pending change to enable the feature flag.

@juliaElastic juliaElastic reopened this Nov 10, 2022
juliaElastic added a commit that referenced this issue Nov 17, 2022
## Summary

Follow up for #141074

Added Request Diagnostics to OpenAPI spec
@juliaElastic
Copy link
Contributor

Moved this back to blocked, as waiting for the dependent changes to be merged before the feature flag can be set to enabled in 8.7.

@amitkanfer
Copy link

Is this still blocked? I see #143459 is completed

@kpollich
Copy link
Member Author

Is this still blocked? I see #143459 is completed

@amitkanfer - Yes this is still blocked by elastic/fleet-server#1902 and elastic/elastic-agent#1703

@mukeshelastic mukeshelastic changed the title [Fleet] Add workflow for requesting and downloading agent diagnostics [Fleet] Add workflow for requesting and downloading agent diagnostics from fleet UI Jan 24, 2023
juliaElastic added a commit that referenced this issue Jan 27, 2023
… to use upload_id (#149575)

## Summary

Closes #141074

Enabled feature flag and tweaked implementation to find file by
`upload_id` rather than doc id.

How to test:
- Start local kibana, start Fleet Server, enroll Elastic Agent from
local (pull [these
changes](elastic/elastic-agent#1703) )
- Click on Request Diagnostics action on the Agent
- The diagnostics file should appear on Agent Details / Diagnostics tab.
- The action should be completed on Agent activity

<img width="1585" alt="image"
src="https://user-images.githubusercontent.com/90178898/214805187-2b1abe34-ba7e-4612-9fad-7ef1f5942f47.png">
<img width="745" alt="image"
src="https://user-images.githubusercontent.com/90178898/214805997-20fdaa01-e4c5-461c-b395-1b1e43117f8a.png">

The file metadata and binary can be queried from these indices:

```
GET .fleet-files-agent/_search

GET .fleet-file-data-agent/_search
```

Tweaked the implementation so that the pending actions are showing up as
soon as the `.fleet-actions` record is created (it can take several
minutes until the action result is ready)
Plus added a tooltip for error status

<img width="948" alt="image"
src="https://user-images.githubusercontent.com/90178898/214841337-eacbb1fc-4934-4d8b-9d52-8db4502d2493.png">



### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
juliaElastic added a commit that referenced this issue Jan 30, 2023
## Summary

Changed query of diagnostics files to speed up seeing the files. This is
because the agent has a delay of about 4m to ack the action, this has to
be fixed separately, see here
elastic/elastic-agent#1703 (comment)

Related to #141074

We can search for the diagnostics file by `agent_id` and `action_id`, so
don't have to wait for the `upload_id` which comes from
`.fleet-actions-results`.


https://user-images.githubusercontent.com/90178898/215451881-bfaa9e86-e055-4490-87b1-dc1d1076a738.mov

Displaying error from agent when diagnostics failed:

<img width="839" alt="image"
src="https://user-images.githubusercontent.com/90178898/215476207-5db7e935-28dd-432e-a6a6-195da162028a.png">


E.g. `.fleet-files-agent`

```
{
        "_index": ".fleet-files-agent-000001",
        "_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c.3a1f21b3-4559-4d3f-aae0-58356c269a92",
        "_score": null,
        "_source": {
          "action_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c",
          "agent_id": "3a1f21b3-4559-4d3f-aae0-58356c269a92",
          "contents": null,
          "file": {
            "ChunkSize": 4194304,
            "Status": "READY",
            "ext": "zip",
            "hash": {
              "md5": "",
              "sha256": ""
            },
            "mime_type": "application/zip",
            "name": "elastic-agent-diagnostics-2023-01-30T10-13-33Z-00.zip",
            "size": 577178
          },
          "src": "agent",
          "upload_id": "988da8ad-9d92-4d18-b5b0-b2a7e77f5a81",
          "upload_start": 1675073615066,
          "transithash": {
            "sha256": "8a417cc8a73e32723ff449b603412113f319c7447044e81acab3f57d4e8226c8"
          }
        },
```

Changed the style to be more consistent:

<img width="898" alt="image"
src="https://user-images.githubusercontent.com/90178898/215492173-7362fab7-15e6-4de9-824b-239164512231.png">



### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
kqualters-elastic pushed a commit to kqualters-elastic/kibana that referenced this issue Feb 6, 2023
… to use upload_id (elastic#149575)

## Summary

Closes elastic#141074

Enabled feature flag and tweaked implementation to find file by
`upload_id` rather than doc id.

How to test:
- Start local kibana, start Fleet Server, enroll Elastic Agent from
local (pull [these
changes](elastic/elastic-agent#1703) )
- Click on Request Diagnostics action on the Agent
- The diagnostics file should appear on Agent Details / Diagnostics tab.
- The action should be completed on Agent activity

<img width="1585" alt="image"
src="https://user-images.githubusercontent.com/90178898/214805187-2b1abe34-ba7e-4612-9fad-7ef1f5942f47.png">
<img width="745" alt="image"
src="https://user-images.githubusercontent.com/90178898/214805997-20fdaa01-e4c5-461c-b395-1b1e43117f8a.png">

The file metadata and binary can be queried from these indices:

```
GET .fleet-files-agent/_search

GET .fleet-file-data-agent/_search
```

Tweaked the implementation so that the pending actions are showing up as
soon as the `.fleet-actions` record is created (it can take several
minutes until the action result is ready)
Plus added a tooltip for error status

<img width="948" alt="image"
src="https://user-images.githubusercontent.com/90178898/214841337-eacbb1fc-4934-4d8b-9d52-8db4502d2493.png">



### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
kqualters-elastic pushed a commit to kqualters-elastic/kibana that referenced this issue Feb 6, 2023
## Summary

Changed query of diagnostics files to speed up seeing the files. This is
because the agent has a delay of about 4m to ack the action, this has to
be fixed separately, see here
elastic/elastic-agent#1703 (comment)

Related to elastic#141074

We can search for the diagnostics file by `agent_id` and `action_id`, so
don't have to wait for the `upload_id` which comes from
`.fleet-actions-results`.


https://user-images.githubusercontent.com/90178898/215451881-bfaa9e86-e055-4490-87b1-dc1d1076a738.mov

Displaying error from agent when diagnostics failed:

<img width="839" alt="image"
src="https://user-images.githubusercontent.com/90178898/215476207-5db7e935-28dd-432e-a6a6-195da162028a.png">


E.g. `.fleet-files-agent`

```
{
        "_index": ".fleet-files-agent-000001",
        "_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c.3a1f21b3-4559-4d3f-aae0-58356c269a92",
        "_score": null,
        "_source": {
          "action_id": "8a004559-0731-4b8f-b29e-d7405ca0d68c",
          "agent_id": "3a1f21b3-4559-4d3f-aae0-58356c269a92",
          "contents": null,
          "file": {
            "ChunkSize": 4194304,
            "Status": "READY",
            "ext": "zip",
            "hash": {
              "md5": "",
              "sha256": ""
            },
            "mime_type": "application/zip",
            "name": "elastic-agent-diagnostics-2023-01-30T10-13-33Z-00.zip",
            "size": 577178
          },
          "src": "agent",
          "upload_id": "988da8ad-9d92-4d18-b5b0-b2a7e77f5a81",
          "upload_start": 1675073615066,
          "transithash": {
            "sha256": "8a417cc8a73e32723ff449b603412113f319c7447044e81acab3f57d4e8226c8"
          }
        },
```

Changed the style to be more consistent:

<img width="898" alt="image"
src="https://user-images.githubusercontent.com/90178898/215492173-7362fab7-15e6-4de9-824b-239164512231.png">



### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
@amolnater-qasource
Copy link

Hi Team,

We have executed 11 testcases under Feature test run for 8.7.0 release at link:

Status:

  • PASS: 11

Build details:
VERSION: 8.7 BC6
BUILD: 61051
COMMIT: 04ef242

As the testing is completed on this feature, we are marking this as QA:Validated.

Please let us know if anything else is required from our end.
Thanks

@amolnater-qasource amolnater-qasource added QA:Validated Issue has been validated by QA and removed QA:Needs Validation Issue needs to be validated by QA labels Mar 21, 2023
@hop-dev
Copy link
Contributor

hop-dev commented Mar 22, 2023

We have had to remove the ILM policies from the index templates in 8.7.0 due to an issue, ore detail here #153483

@kpollich
Copy link
Member Author

We should write a docs issue and prepare some draft documentation for this feature. Support and our end users would greatly appreciate some references to this functionality in our troubleshooting docs.

@jen-huang
Copy link
Contributor

@juliaElastic Could you file a docs issue for this? (cc @karenzone)

@kpollich kpollich changed the title [Fleet] Add workflow for requesting and downloading agent diagnostics from fleet UI [Fleet] Add workflow for requesting and downloading agent diagnostics from Fleet UI Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
QA:Validated Issue has been validated by QA Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
8 participants