Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Provide framework services to enable rule executors to short-circuit execution #113462

Closed
Tracked by #123053
ymao1 opened this issue Sep 29, 2021 · 8 comments · Fixed by #120506
Closed
Tracked by #123053
Assignees
Labels
estimate:medium Medium Estimated Level of Effort Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented Sep 29, 2021

As part of this investigation, we have identified improvements in the way the alerting framework can respond to being cancelled. With this issue, we have taken the first steps to logging the cancellation in the event log and in the rule execution status. We now want to provide some helper functions to be passed into the rule executor that rule type producers can use to short-circuit execution on cancellation.

POC here

We want to provide 2 helpers:

  1. cancellableEsSearch() - This service function provides a wrapper around the Elasticsearch client search() function that calls the built in abort method to cancel any in progress searches if a task is a cancelled mid-query. If this occurs, a RequestAbortedError is thrown and rule execution will end in an error. Ideally if this service function is provided, rule executors will always use this instead of directly using the Elasticsearch client search(). We can also consider wrapping the RequestAborted error in a custom error type to provide more granularity.

  2. executionIsCancelled() - This service function provides a way for rule executors to check whether the task has been cancelled. It returns a simple boolean. This would handle the scenario where the ES query returns fairly quickly but the post-processing of the query results takes a significant amount of time. Ideally, if this service function is provided, rule executors will periodically check task cancellation status to determine whether or not to proceed. The framework should provide a custom error type to be thrown in these scenarios (again, to be used at the rule type author's discretion).

These helpers depend on some sort of cancellation signal that is set when the alerting cancel() function is called. Note that in the POC, this signal is an Observable but whoever implements this issue can decide whether to use Observables or Promises or something else.

@ymao1 ymao1 added blocked Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework estimate:medium Medium Estimated Level of Effort labels Sep 29, 2021
@ymao1
Copy link
Contributor Author

ymao1 commented Sep 29, 2021

Blocked by #113459

@mikecote
Copy link
Contributor

mikecote commented Nov 24, 2021

Regarding cancellableEsSearch, I think we should name it search and make it the go-to way to do queries (and move away from es client search function). Eventually, every rule type executor must abort their searches as soon as the rule times out and not perform further operations afterwards. This approach would solve it for searches as a starting point.

Also, I've seen scenarios where rule executions would call this function thousands of times and create instability in Kibana (blocking the event loop). With this search() approach, if necessary, we could explore in the future capping the number of times the rule type execution can perform search queries.

@pauldotpower
Copy link
Contributor

+1 for driving rules through a kibana search client ( a think layer over the ES client) that we can instrument, rate limit, cancel and add context to each query request. Today we have little if any visibility into what rules are doing with ES. Ideally it's just a thin layer over the ES client and we don't become a bottleneck.

On executionIsCancelled - we trust rule implementers to call it within resource loops to be effective. If they don't or if they don't call it at all we can still have a long running/cpu intensive rule and associated impact. Do we have any way of forcefully interrupting or terminating a rule without the rule implementer being involved ?

@mikecote
Copy link
Contributor

mikecote commented Nov 25, 2021

On executionIsCancelled - we trust rule implementers to call it within resource loops to be effective. If they don't or if they don't call it at all we can still have a long running/cpu intensive rule and associated impact. Do we have any way of forcefully interrupting or terminating a rule without the rule implementer being involved ?

At this time, we don't have a way to do so. We proposed in a recent RFC to explore worker threads as an alternative but have decided not to go that approach given the level of effort it would take before the problem starts improving. I do think longer term we will have to explore something like worker threads or move away from JavaScript executors to allow us to guarantee the rule execution is terminated.

@ymao1 ymao1 removed the blocked label Nov 29, 2021
@ymao1 ymao1 self-assigned this Dec 1, 2021
@ymao1
Copy link
Contributor Author

ymao1 commented Dec 1, 2021

Should see if we can use the built in Abort Controller available in elasticsearch js 8.0: elastic/elasticsearch-js#1297

@pauldotpower
Copy link
Contributor

I'm aware of the Long Running Task RFC and JS Cluster Proposal RFC - if there's anything on worker thread investigation let me know.

If a worker thread like solution is out of scope here then I'd just ask that we consider whether we would do anything differently if expected to have it in place later.

From my perspective neither clustering nor cooperative cancelation resolve the challenge of event loop blocking tasks running in the same process as the task manager.

  • Clustering is resource expensive and just spreads out the issue so that it's less obvious but it still exists.
  • Cooperative requires rule owners to inject cancelation checks at the appropriate points for each rule to be effective and doesn't deal with event loop blocking tasks.

An ideal solution would do both - ask a task to cancel itself and then terminate it only if that fails - so the above adds value but it feels like we're starting with the less impactful approach.

@ymao1
Copy link
Contributor Author

ymao1 commented Dec 2, 2021

As far as I know, the conventional way to cancel an async function is to pass some sort of abort signal to the function and have the function add a handler to listen for the signal and throw an error accordingly. This issue at least covers the first step of that (passing in an abort signal) and while we're leaving it up to the rule type executors for now, we could definitely consider taking the next step and adding in handlers to existing rule executors ourselves with some sort of linting strategy to ensure that future rule executors implement the handler as well.

Are there other methods of terminating an async function (outside of worker threads) that we could/should be investigating?

@mikecote
Copy link
Contributor

mikecote commented Dec 2, 2021

It's also important to note that we have solution teams who built entire products within these rule executor functions beyond running a single query / creating alerts. And terminating the execution abruptly could cause some integrity issues depending on what the executor was doing as it terminated. For example, the alerts could be half-written to an index within the rule executor during the execution's termination, but the framework did not get a chance to send any notifications. We'll have to consider these scenarios or simplify what rule executors do as we develop a catch-all approach to terminate the executions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:medium Medium Estimated Level of Effort Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants