[Alerting] Provide framework services to enable rule executors to short-circuit execution #113462

ymao1 · 2021-09-29T17:47:26Z

As part of this investigation, we have identified improvements in the way the alerting framework can respond to being cancelled. With this issue, we have taken the first steps to logging the cancellation in the event log and in the rule execution status. We now want to provide some helper functions to be passed into the rule executor that rule type producers can use to short-circuit execution on cancellation.

POC here

We want to provide 2 helpers:

cancellableEsSearch() - This service function provides a wrapper around the Elasticsearch client search() function that calls the built in abort method to cancel any in progress searches if a task is a cancelled mid-query. If this occurs, a RequestAbortedError is thrown and rule execution will end in an error. Ideally if this service function is provided, rule executors will always use this instead of directly using the Elasticsearch client search(). We can also consider wrapping the RequestAborted error in a custom error type to provide more granularity.
executionIsCancelled() - This service function provides a way for rule executors to check whether the task has been cancelled. It returns a simple boolean. This would handle the scenario where the ES query returns fairly quickly but the post-processing of the query results takes a significant amount of time. Ideally, if this service function is provided, rule executors will periodically check task cancellation status to determine whether or not to proceed. The framework should provide a custom error type to be thrown in these scenarios (again, to be used at the rule type author's discretion).

These helpers depend on some sort of cancellation signal that is set when the alerting cancel() function is called. Note that in the POC, this signal is an Observable but whoever implements this issue can decide whether to use Observables or Promises or something else.

The text was updated successfully, but these errors were encountered:

ymao1 · 2021-09-29T17:47:40Z

Blocked by #113459

mikecote · 2021-11-24T17:49:35Z

Regarding cancellableEsSearch, I think we should name it search and make it the go-to way to do queries (and move away from es client search function). Eventually, every rule type executor must abort their searches as soon as the rule times out and not perform further operations afterwards. This approach would solve it for searches as a starting point.

Also, I've seen scenarios where rule executions would call this function thousands of times and create instability in Kibana (blocking the event loop). With this search() approach, if necessary, we could explore in the future capping the number of times the rule type execution can perform search queries.

pauldotpower · 2021-11-25T14:50:01Z

+1 for driving rules through a kibana search client ( a think layer over the ES client) that we can instrument, rate limit, cancel and add context to each query request. Today we have little if any visibility into what rules are doing with ES. Ideally it's just a thin layer over the ES client and we don't become a bottleneck.

On executionIsCancelled - we trust rule implementers to call it within resource loops to be effective. If they don't or if they don't call it at all we can still have a long running/cpu intensive rule and associated impact. Do we have any way of forcefully interrupting or terminating a rule without the rule implementer being involved ?

mikecote · 2021-11-25T15:22:08Z

On executionIsCancelled - we trust rule implementers to call it within resource loops to be effective. If they don't or if they don't call it at all we can still have a long running/cpu intensive rule and associated impact. Do we have any way of forcefully interrupting or terminating a rule without the rule implementer being involved ?

At this time, we don't have a way to do so. We proposed in a recent RFC to explore worker threads as an alternative but have decided not to go that approach given the level of effort it would take before the problem starts improving. I do think longer term we will have to explore something like worker threads or move away from JavaScript executors to allow us to guarantee the rule execution is terminated.

ymao1 · 2021-12-01T21:35:08Z

Should see if we can use the built in Abort Controller available in elasticsearch js 8.0: elastic/elasticsearch-js#1297

pauldotpower · 2021-12-02T11:38:39Z

I'm aware of the Long Running Task RFC and JS Cluster Proposal RFC - if there's anything on worker thread investigation let me know.

If a worker thread like solution is out of scope here then I'd just ask that we consider whether we would do anything differently if expected to have it in place later.

From my perspective neither clustering nor cooperative cancelation resolve the challenge of event loop blocking tasks running in the same process as the task manager.

Clustering is resource expensive and just spreads out the issue so that it's less obvious but it still exists.
Cooperative requires rule owners to inject cancelation checks at the appropriate points for each rule to be effective and doesn't deal with event loop blocking tasks.

An ideal solution would do both - ask a task to cancel itself and then terminate it only if that fails - so the above adds value but it feels like we're starting with the less impactful approach.

ymao1 · 2021-12-02T13:52:03Z

As far as I know, the conventional way to cancel an async function is to pass some sort of abort signal to the function and have the function add a handler to listen for the signal and throw an error accordingly. This issue at least covers the first step of that (passing in an abort signal) and while we're leaving it up to the rule type executors for now, we could definitely consider taking the next step and adding in handlers to existing rule executors ourselves with some sort of linting strategy to ensure that future rule executors implement the handler as well.

Are there other methods of terminating an async function (outside of worker threads) that we could/should be investigating?

mikecote · 2021-12-02T14:14:33Z

It's also important to note that we have solution teams who built entire products within these rule executor functions beyond running a single query / creating alerts. And terminating the execution abruptly could cause some integrity issues depending on what the executor was doing as it terminated. For example, the alerts could be half-written to an index within the rule executor during the execution's termination, but the framework did not get a chance to send any notifications. We'll have to consider these scenarios or simplify what rule executors do as we develop a catch-all approach to terminate the executions.

ymao1 added blocked Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework estimate:medium Medium Estimated Level of Effort labels Sep 29, 2021

ymao1 mentioned this issue Sep 29, 2021

[Alerting] Implement alerting task cancel() hook #113459

Closed

gmmorris mentioned this issue Oct 1, 2021

[Alerting] Investigate resilience / side effects of excessively long running rules #111259

Open

3 tasks

ymao1 mentioned this issue Oct 4, 2021

[RAC][Rule Registry] Rule registry executors skip writing AAD when execution is cancelled #113804

Closed

weltenwort mentioned this issue Nov 19, 2021

[Metrics Alerting] Implement a cap for group_by cardinality within metrics threshold rule type #119073

Closed

ymao1 removed the blocked label Nov 29, 2021

ymao1 self-assigned this Dec 1, 2021

ymao1 mentioned this issue Dec 8, 2021

[Alerting] Adding rule executor service functions to support execution cancellation #120506

Merged

2 tasks

ymao1 closed this as completed in #120506 Dec 14, 2021

jasonrhodes mentioned this issue Jan 14, 2022

[Observability Alerting][R&D] Implement/verify cancellation for observability rules #123053

Closed

3 tasks

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

weltenwort mentioned this issue Sep 27, 2022

[Logs UI] Respect log threshold rule execution timeouts and circuit-breakers #141925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Provide framework services to enable rule executors to short-circuit execution #113462

[Alerting] Provide framework services to enable rule executors to short-circuit execution #113462

ymao1 commented Sep 29, 2021

ymao1 commented Sep 29, 2021

mikecote commented Nov 24, 2021 •

edited

Loading

pauldotpower commented Nov 25, 2021

mikecote commented Nov 25, 2021 •

edited

Loading

ymao1 commented Dec 1, 2021

pauldotpower commented Dec 2, 2021

ymao1 commented Dec 2, 2021

mikecote commented Dec 2, 2021 •

edited

Loading

[Alerting] Provide framework services to enable rule executors to short-circuit execution #113462

[Alerting] Provide framework services to enable rule executors to short-circuit execution #113462

Comments

ymao1 commented Sep 29, 2021

ymao1 commented Sep 29, 2021

mikecote commented Nov 24, 2021 • edited Loading

pauldotpower commented Nov 25, 2021

mikecote commented Nov 25, 2021 • edited Loading

ymao1 commented Dec 1, 2021

pauldotpower commented Dec 2, 2021

ymao1 commented Dec 2, 2021

mikecote commented Dec 2, 2021 • edited Loading

mikecote commented Nov 24, 2021 •

edited

Loading

mikecote commented Nov 25, 2021 •

edited

Loading

mikecote commented Dec 2, 2021 •

edited

Loading