Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet Server] Fleet Server Observability #812

Open
8 tasks
mostlyjason opened this issue Mar 26, 2021 · 14 comments
Open
8 tasks

[Fleet Server] Fleet Server Observability #812

mostlyjason opened this issue Mar 26, 2021 · 14 comments
Labels
enhancement New feature or request estimation:Week Task that represents a week of work. Integration:fleet_server Fleet Server Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team [elastic/fleet]

Comments

@mostlyjason
Copy link
Contributor

mostlyjason commented Mar 26, 2021

Let's add a dashboard for Fleet Server operators to help them scale the server when needed and troubleshoot issues

It should show:

  • System metrics like CPU, memory usage by host over time. Should be accurate for VMs and containers. Helps to identify infrastructure limits. This should work for agents with system monitoring enabled and hosted agents with only internal monitoring.
  • Fleet server process metrics like CPU, memory usage by host over time. Should be accurate for VMs and containers. Identifies capacity usage from Fleet Server compared to other processes.
  • Status by host over time. Provides a history of when Fleet Server was offline, updating, unhealthy, etc.
  • Log stream component showing errors. Useful for troubleshooting.
  • Add a note with a link to the stack monitoring app where users can monitor APM server and standalone FB/MB, which are in the same container.
  • Filter on hostname. Lets operators isolate metrics from particular Fleet server hosts.

Stretch:

  • Number of active connections by host over time. Lets operators see resource usage as a function of capacity.
  • Number of rejected connections by host over time. Lets operators see when limits are reached and the impact on clients.

Related issues:

Open questions:

  1. Should we have a separate dashboard for Fleet Server or combine it with the Elastic Agent dashboard?
    • They should have separate dashboards. They have separate metrics to visualize, like only Fleet Server has connection count. System metrics are particularly useful for Fleet Server because the goal is to maximize utilization and observe when its necessary to scale the infrastructure. The Elastic Agent running on an endpoint should have low utilization so it will be easier to visualize these use cases separately. Also, its a standard pattern to include dashboards for each integration, so it will be more discoverable as part of the Fleet Server integration.
  2. Confirm system metrics are enabled on cloud
  3. How can we filter the fleet server hosts from the other hosts in the dashboards?
@mostlyjason mostlyjason added the Team:Elastic-Agent Label for the Agent team label Mar 26, 2021
@elasticmachine
Copy link

Pinging @elastic/agent (Team:Agent)

@simitt
Copy link
Contributor

simitt commented Mar 29, 2021

For which version is this planned?

@ph
Copy link
Contributor

ph commented Mar 29, 2021

@simitt We haven't added it to the roadmap yet.

@mostlyjason
Copy link
Contributor Author

@ravikesarwani Can I get your input on the Fleet Server dashboard definition? In particular, the open question about how it relates to the Elastic Agent dashboard?

@ravikesarwani
Copy link

@mostlyjason Yes, I would love to help and add my thoughts but don't have a lot of context around the topic of fleet server. Can we setup some time next week to discuss so that you can fill me in and then I can research and add my thoughts?

@michalpristas
Copy link
Contributor

added a 7.14 label, please update if needed for another version

@EricDavisX
Copy link
Contributor

@mostlyjason just pinging - I'm doing follow-ups on open issue for 7.14 - is this still under review / possible merge for 7.14? Assuming not, I'll remove the 7.14 label, please add back if any work was done for this cycle. If this is urgent for the Fleet / Agent GA we may yet have time to finish it out.

@elasticmachine
Copy link

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@nimarezainia nimarezainia removed the Team:Elastic-Agent Label for the Agent team label Jan 20, 2022
@nimarezainia
Copy link
Contributor

@jlind23 i added this to the data plane board as for some reason GH is not finding the control plane board in this repo. Do you have the same issue?

@nimarezainia nimarezainia added the Team:Fleet Label for the Fleet team [elastic/fleet] label Jan 20, 2022
@elasticmachine
Copy link

Pinging @elastic/fleet (Team:Fleet)

@nimarezainia nimarezainia added the enhancement New feature or request label Jan 20, 2022
@jlind23
Copy link
Contributor

jlind23 commented Jan 24, 2022

@nimarezainia Nope i don't have this issue. I just added it to the control plane board.

@nimarezainia nimarezainia changed the title [Fleet Server] Add dashboard for Fleet Server [Fleet Server] Fleet Server Observability Feb 7, 2022
@michel-laterman michel-laterman removed their assignment Feb 24, 2022
@jlind23
Copy link
Contributor

jlind23 commented Mar 16, 2022

This one is about exporting this newly created dashboard and pushing it to fleet-server package.

@jlind23
Copy link
Contributor

jlind23 commented May 5, 2022

@nimarezainia Do you have a full list of what are the needed indicators? Could you please work on it before the 8.4 kick off?

@jlind23 jlind23 added v8.4.0 estimation:Week Task that represents a week of work. and removed 8.4-candidate labels May 24, 2022
@jlind23
Copy link
Contributor

jlind23 commented May 24, 2022

@nimarezainia could you jump in here please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request estimation:Week Task that represents a week of work. Integration:fleet_server Fleet Server Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team [elastic/fleet]
Projects
None yet
Development

No branches or pull requests