Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting + Task Manager] Stack Monitoring for Alerting / Task Manager #95197

Closed
gmmorris opened this issue Mar 23, 2021 · 3 comments
Closed
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Feature:Task Manager impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. insight Issues related to user insight into platform operations and resilience resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:Monitoring Stack Monitoring team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

There are now public docs for monitoring and troubleshooting Task Manager (and Alerting), but there is no integrated experience for these concerns.

This discuss issue is here to start the conversation around adding some kind of presence in Stack Monitoring which would aid administrators in identifying issues in TM/Alerting.

@gmmorris gmmorris added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Mar 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote mikecote added the Team:Monitoring Stack Monitoring team label Mar 30, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@chrisronline
Copy link
Contributor

I'd like to kick start this effort by collecting user stories related to monitoring alerting.

For example, a sample story might be:

  • As a user, task manager has killed by Kibana instance and I had no insight that Kibana was close to topping over. I need more insight into the performance of Kibana to know when this will happen.

It feels like we are lacking in some part of this, but we should start from the end user to understand what is lacking and how we can change that.

I think our path will lead us to indexing monitoring metrics into the Kibana Stack Monitoring indices but the first step is defining which metrics we need to index, and we need to understand the user stories to help define those metrics. (#98625 looks related)

cc @arisonl

@gmmorris gmmorris added the Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework label Jul 1, 2021
@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 15, 2021
@gmmorris gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility insight Issues related to user insight into platform operations and resilience estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 13, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@gmmorris gmmorris added the impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. label Sep 16, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Feature:Task Manager impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. insight Issues related to user insight into platform operations and resilience resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:Monitoring Stack Monitoring team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

5 participants