[ResponseOps][Task Manager][mget Claimer] have claimer retry search if no eligeable tasks result from mget/update #184940

pmuellr · 2024-06-06T14:56:46Z

In PR implement task claiming strategy mget #180485 we implemented an alternative task claiming strategy, but it has the following problem:

The original task search returns candidate tasks, which may be skipped if the mget determines the task doc was updated, or the bulk indicate indicates a conflict. In the worst case, this can result in the task claimer returning no tasks, even if there are tasks available to run.

Suggest we retry the entire claim phase, starting with the search, when we determine that:

there are tasks that could be matched (based on returned (eg, hits.total indicates more tasks available)
no tasks were matched in this claim cycle

I think there's likely a question if we want to change the test from "no tasks found" to "not many tasks found". For instance, if there are outstanding tasks to run, but we filtered out all but 1 because of conflicts, we probably want to try for N-1 (where N is the number of tasks requested). Kinda thing. Not sure what a good number would be though.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-06-06T14:56:48Z

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote · 2024-06-06T16:49:40Z

I wonder if we want to redo the search or if we want to continue going through the previous candidate tasks 🤔 if we do the search shortly after the previous one, there's a chance the index won't be refreshed yet from the regular 1s interval and it would return the same documents. Maybe we just need do mget + bulkUpdate a few times, or just continue with bulkUpdate.

mikecote · 2024-10-30T11:17:00Z

I tried to do a PoC on this here: #198183

It gets tricky but seems doable. I think the best way to have this implemented is looping after the first mget operation and keep trying to claim currentTasks by doing bulk updates. We'd just have to make sure we only try a task once so we don't endup in a retry loop, and maybe limit the number of loops..

I noticed during some of my performance tests we sometimes have a high number of claim conflicts, perhaps caused by the claiming cycles being synchronized across the Kibana nodes.. so it would be valuable to have this in place. It will also add a few ms to the claiming cycle, randomly offsetting it at the same time (a feature we turned off for mget, but could be a last resort).

pmuellr added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 6, 2024

pmuellr mentioned this issue Jun 6, 2024

[ResponseOps] implement task claiming strategy mget #180485

Merged

mikecote mentioned this issue Aug 7, 2024

Scaling the alerting throughput ceiling from 3,200 to 32,000+ rules per minute #188194

Open

48 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps][Task Manager][mget Claimer] have claimer retry search if no eligeable tasks result from mget/update #184940

[ResponseOps][Task Manager][mget Claimer] have claimer retry search if no eligeable tasks result from mget/update #184940

pmuellr commented Jun 6, 2024

elasticmachine commented Jun 6, 2024

mikecote commented Jun 6, 2024 •

edited

Loading

mikecote commented Oct 30, 2024

[ResponseOps][Task Manager][mget Claimer] have claimer retry search if no eligeable tasks result from mget/update #184940

[ResponseOps][Task Manager][mget Claimer] have claimer retry search if no eligeable tasks result from mget/update #184940

Comments

pmuellr commented Jun 6, 2024

elasticmachine commented Jun 6, 2024

mikecote commented Jun 6, 2024 • edited Loading

mikecote commented Oct 30, 2024

mikecote commented Jun 6, 2024 •

edited

Loading