Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Optimize status field output for health api #102400

Open
chrisronline opened this issue Jun 16, 2021 · 2 comments
Open

[Task Manager] Optimize status field output for health api #102400

chrisronline opened this issue Jun 16, 2021 · 2 comments
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager insight Issues related to user insight into platform operations and resilience response-ops-ec-backlog ResponseOps E&C backlog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@chrisronline
Copy link
Contributor

chrisronline commented Jun 16, 2021

Relates to #101505

There are some points of confusion related to the task manager health API response that we should discuss and potentially fix.

  1. The runtime health status is always OK, even though we set the overall status to Error based on a runtime metric

  2. The workload health status is always OK, even though we set the overall status to Error based on a workload metric

  3. In [Task Manager] Log at different levels based on the state #101751, we are logging a warning if the p99 runtime drift - maybe we should set the status to warning for the runtime bucket when this happens?

@chrisronline chrisronline added discuss Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 16, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris gmmorris added Project:ObservabilityOfAlerting Alerting team project for observability of alerting. and removed Project:ObservabilityOfAlerting Alerting team project for observability of alerting. labels Jun 30, 2021
@gmmorris
Copy link
Contributor

The idea was that each section is updated independently and so when you get the overall health it would look at the constituent parts and reflect the overall health so that:

  1. If any part is in Error, then the whole thing is in Error
  2. If any part say's it's OK but it's not fresh enough (the last update was OK but for some reason hasn't updated in 5 minutes) then the overall API returns an Error satte
  3. It would be clear which part was in error and which is ok

We can obviously throw that idea out 🤷 but that's where the idea comes from.

@mikecote mikecote added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 21, 2021
@gmmorris gmmorris added insight Issues related to user insight into platform operations and resilience estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 16, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@mikecote mikecote added the response-ops-ec-backlog ResponseOps E&C backlog label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager insight Issues related to user insight into platform operations and resilience response-ops-ec-backlog ResponseOps E&C backlog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

5 participants