Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps][Task Manager][mget Claimer] have claimer retry search if no eligeable tasks result from mget/update #184940

Open
pmuellr opened this issue Jun 6, 2024 · 3 comments
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Jun 6, 2024

In PR implement task claiming strategy mget #180485 we implemented an alternative task claiming strategy, but it has the following problem:

The original task search returns candidate tasks, which may be skipped if the mget determines the task doc was updated, or the bulk indicate indicates a conflict. In the worst case, this can result in the task claimer returning no tasks, even if there are tasks available to run.

Suggest we retry the entire claim phase, starting with the search, when we determine that:

  • there are tasks that could be matched (based on returned (eg, hits.total indicates more tasks available)
  • no tasks were matched in this claim cycle

I think there's likely a question if we want to change the test from "no tasks found" to "not many tasks found". For instance, if there are outstanding tasks to run, but we filtered out all but 1 because of conflicts, we probably want to try for N-1 (where N is the number of tasks requested). Kinda thing. Not sure what a good number would be though.

@pmuellr pmuellr added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 6, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor

mikecote commented Jun 6, 2024

I wonder if we want to redo the search or if we want to continue going through the previous candidate tasks 🤔 if we do the search shortly after the previous one, there's a chance the index won't be refreshed yet from the regular 1s interval and it would return the same documents. Maybe we just need do mget + bulkUpdate a few times, or just continue with bulkUpdate.

@mikecote
Copy link
Contributor

I tried to do a PoC on this here: #198183

It gets tricky but seems doable. I think the best way to have this implemented is looping after the first mget operation and keep trying to claim currentTasks by doing bulk updates. We'd just have to make sure we only try a task once so we don't endup in a retry loop, and maybe limit the number of loops..

I noticed during some of my performance tests we sometimes have a high number of claim conflicts, perhaps caused by the claiming cycles being synchronized across the Kibana nodes.. so it would be valuable to have this in place. It will also add a few ms to the claiming cycle, randomly offsetting it at the same time (a feature we turned off for mget, but could be a last resort).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

3 participants