Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert alerts to use task manager intervals #46001

Closed
mikecote opened this issue Sep 18, 2019 · 19 comments · Fixed by #80149
Closed

Convert alerts to use task manager intervals #46001

mikecote opened this issue Sep 18, 2019 · 19 comments · Fixed by #80149
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Sep 18, 2019

This portion will require #45152. Once completed, alerting should use the task manager's interval property as well as the update API whenever the interval changes.

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services

@mikecote mikecote self-assigned this Sep 18, 2019
@mikecote
Copy link
Contributor Author

This will require an update API in task manager in order to update interval value (causing a re-schedule). This is getting more similar to #45152 the more I dig into it. Both issues could be resolved in 1-2 PRs. cc @gmmorris.

@gmmorris gmmorris self-assigned this Sep 18, 2019
@mikecote
Copy link
Contributor Author

Updating issue: This portion will require #45152. Once completed, alerting should use the task manager's interval property as well as the update API whenever the interval changes.

@gmmorris
Copy link
Contributor

gmmorris commented Nov 20, 2019

I've began investigating the changes needed in alerting and a couple of questions spring to mind:

  1. Do we still want to store interval on Alert SavedObjects? If interval is used to schedule the Task, then, in theory, we have no need to store the interval on the Alert, but before I clean off that field I want to make sure this is the case. UPDATE: I have now realised we do need it, as enabling/disabling an Alert deletes the Task, but you don't need to specify the interval when enabling, so we'd need to store the interval somewhere as a source of truth.
  2. How do we reschedule existing Alerts once we've made this change? This is a unique migration as it's not a Alert -> Alert migration, but rather, we want to migrate an Alerting Task based on a version change in such a way that info that currently resides on the Alert is used to migrate the Task. As far as I'm aware, migrations can't currently do this.
    Any ideas?

@mikecote
Copy link
Contributor Author

This is a unique migration as it's not a Alert -> Alert migration, but rather, we want to migrate an Alerting Task based on a version change in such a way that info that currently resides on the Alert is used to migrate the Task.

Luckily we don't need to support migrations yet (ex: 7.5 -> 7.6) as we currently don't support existing alerts in pre 7.6 releases.

@gmmorris
Copy link
Contributor

Ahh so we're fine with just breaking existing alerts when moving form 7.5 to 7.6?
I guess, as we're not even in beta yet that is reasonable.... lucky 😬

@mikecote
Copy link
Contributor Author

@gmmorris I think we can close this one now that it is no longer relevant and the updated work is tracked in #51575?

@bmcconaghy bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019
@gmmorris
Copy link
Contributor

@mikecote I don't think we should close this, as long term, I don't think we want to maintain two separate interval implementations, but I no longer think this is higher priority than the migration to the Kibana Platform.

Shall we move this to 7.7 bellow migration to the platform?

@mikecote
Copy link
Contributor Author

@gmmorris good point, we should re-triage this issue at the next sync. I guess it can be 7.7 or back into "code debt" column.

@gmmorris
Copy link
Contributor

This has come up again in #63188 (comment) so we should probably reprioritise this again.

Looking into the above issue we've realised that Alerting needs quite a bit of what Task Manager offers out of the box in terms of retries for scheduled tasks, but it isn't a 1-to-1 fit and we'll likely need to make some changes to TM to properly support Alertings's needs.

Whoever picks up this issue, please read through the comments here: #63188 (comment)

In addition, something worth noting:
https://github.com/elastic/kibana/blob/46901f557b11e3e53256c44534e579712d2e18e9/x-pack/plugins/task_manager/server/task_runner.ts#L237-245

In the above section you'll note that Task Manager uses a different path when a task has a schedule on it, which doesn't currently apply to Alerts, and we want it to. That said - the implementation is problematic for us as it uses now + timeout as the new schedule, when in fact, that might be completely wrong as it might be way too late for an interval of 5s but also way too soon in certain scenarios.

This will have to be rethought, and is likely wrong from TM as a whole, not just Alerting.
I'm not sure what the right solution is, but this seems the wrong scheduling for a timeout. 🤔

@mikecote
Copy link
Contributor Author

Adding a note that we should have a discussion on this issue before starting any development. Just to review what level of effort will be required, what approach and alternatives we have, what problem(s) we're solving, etc.

@gmmorris
Copy link
Contributor

The work on TM observability (#77456) has brought to my attention that we can't visualize the "stress" that alerting puts on TM until this issue is done.
As TM has no understanding of Alert tasks as being scheduled tasks (because they reschedule themselves), there is no way for TM to know the cadence of Alerting tasks which means the observability info there isn't quite as useful.

Luckily this issue is prioritised, so we'll have this pre GA, but I wanted to mention this so that we understand just how valuable this issue is. 👍

@gmmorris
Copy link
Contributor

Also worth noting - TM doesn't support Day / Hour scheduling, only Seconds and Minutes, so we'll need to add that to TM as part of this work.

@mikecote
Copy link
Contributor Author

mikecote commented Oct 1, 2020

I think some of the early issues attempting to do this was OCC issues when a task was running and would / could revert the updated schedule. There's probably some lessons learned from updating alerts w/ OCC logic that could be applied here (cc @pmuellr)

@mikecote
Copy link
Contributor Author

mikecote commented Oct 1, 2020

This issue should solve test 1 and 3 from #53650 (comment).

  1. Alert dies after 3 consecutive timeouts
  2. Alerts execution that times out will try again 10 minutes later

Otherwise separate issues should be opened for those as I believe they are the driver why this issue is in To-Do for GA.

@gmmorris
Copy link
Contributor

gmmorris commented Oct 1, 2020

I think some of the early issues attempting to do this was OCC issues when a task was running and would / could revert the updated schedule. There's probably some lessons learned from updating alerts w/ OCC logic that could be applied here (cc @pmuellr)

We might be able to side step this by allowing partial updates of certain parts of the task, but there's a danger here that it might still cause a failure of a running task if we don't limit what fields it touches 🤔

@gmmorris
Copy link
Contributor

gmmorris commented Oct 2, 2020

One important note about this issue - if we do not migrate Alerting to use the Task manager intervals, the alerting tasks will not appear to Task Manager as a recurring task, but rather as a single execution task.

This means that some of the observability added by #77456 won't be as useful for diagnosing alerting problems as they would be otherwise.
This is because some of the workload stats we aggregate include information such as interval frequency (to answer the question of "how many tasks are scheduled to run every 1s?). This will not include alerting tasks as they do not have an interval field on them.

We could address this by providing these stats in the Alerting health endpoint, but that's far from ideal.

Another thought is that we could possibly have an interval field on the alerts even if it isn't used and the alert task reschedules itself (the way it does today). Some investigation might be needed to see if that has a negative impact or not. But it's a thought. 🤷

@mikecote
Copy link
Contributor Author

mikecote commented Oct 2, 2020

Side note: this issue should be paired with another person due to the potential complexity it will have.

@gmmorris gmmorris self-assigned this Oct 5, 2020
@gmmorris
Copy link
Contributor

I've been spiking this locally and it seems this isn't nearly as complicated as we originally thought. We can break this down into 2 steps:

  1. Add support for h and d intervals to Task Manager to align with Alerting
  2. Add support for schedule in Task return type. This would allow tasks to return a recurring schedule instead of just a runAt when they return a value. This addresses the update/OCC issues we encountered around this in the past.
  3. Add fresh pull of schedule in case it was updated during a task run.

Spiked here: #80149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
6 participants