-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] Provide framework services to enable rule executors to short-circuit execution #113462
Comments
Blocked by #113459 |
Regarding Also, I've seen scenarios where rule executions would call this function thousands of times and create instability in Kibana (blocking the event loop). With this |
+1 for driving rules through a kibana search client ( a think layer over the ES client) that we can instrument, rate limit, cancel and add context to each query request. Today we have little if any visibility into what rules are doing with ES. Ideally it's just a thin layer over the ES client and we don't become a bottleneck. On |
At this time, we don't have a way to do so. We proposed in a recent RFC to explore worker threads as an alternative but have decided not to go that approach given the level of effort it would take before the problem starts improving. I do think longer term we will have to explore something like worker threads or move away from JavaScript executors to allow us to guarantee the rule execution is terminated. |
Should see if we can use the built in Abort Controller available in elasticsearch js 8.0: elastic/elasticsearch-js#1297 |
I'm aware of the Long Running Task RFC and JS Cluster Proposal RFC - if there's anything on worker thread investigation let me know. If a worker thread like solution is out of scope here then I'd just ask that we consider whether we would do anything differently if expected to have it in place later. From my perspective neither clustering nor cooperative cancelation resolve the challenge of event loop blocking tasks running in the same process as the task manager.
An ideal solution would do both - ask a task to cancel itself and then terminate it only if that fails - so the above adds value but it feels like we're starting with the less impactful approach. |
As far as I know, the conventional way to cancel an async function is to pass some sort of abort signal to the function and have the function add a handler to listen for the signal and throw an error accordingly. This issue at least covers the first step of that (passing in an abort signal) and while we're leaving it up to the rule type executors for now, we could definitely consider taking the next step and adding in handlers to existing rule executors ourselves with some sort of linting strategy to ensure that future rule executors implement the handler as well. Are there other methods of terminating an async function (outside of worker threads) that we could/should be investigating? |
It's also important to note that we have solution teams who built entire products within these rule executor functions beyond running a single query / creating alerts. And terminating the execution abruptly could cause some integrity issues depending on what the executor was doing as it terminated. For example, the alerts could be half-written to an index within the rule executor during the execution's termination, but the framework did not get a chance to send any notifications. We'll have to consider these scenarios or simplify what rule executors do as we develop a catch-all approach to terminate the executions. |
As part of this investigation, we have identified improvements in the way the alerting framework can respond to being cancelled. With this issue, we have taken the first steps to logging the cancellation in the event log and in the rule execution status. We now want to provide some helper functions to be passed into the rule executor that rule type producers can use to short-circuit execution on cancellation.
POC here
We want to provide 2 helpers:
cancellableEsSearch()
- This service function provides a wrapper around the Elasticsearch client search() function that calls the built in abort method to cancel any in progress searches if a task is a cancelled mid-query. If this occurs, a RequestAbortedError is thrown and rule execution will end in an error. Ideally if this service function is provided, rule executors will always use this instead of directly using the Elasticsearch client search(). We can also consider wrapping the RequestAborted error in a custom error type to provide more granularity.executionIsCancelled()
- This service function provides a way for rule executors to check whether the task has been cancelled. It returns a simple boolean. This would handle the scenario where the ES query returns fairly quickly but the post-processing of the query results takes a significant amount of time. Ideally, if this service function is provided, rule executors will periodically check task cancellation status to determine whether or not to proceed. The framework should provide a custom error type to be thrown in these scenarios (again, to be used at the rule type author's discretion).These helpers depend on some sort of cancellation signal that is set when the alerting
cancel()
function is called. Note that in the POC, this signal is an Observable but whoever implements this issue can decide whether to use Observables or Promises or something else.The text was updated successfully, but these errors were encountered: