-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task Manager v1.0.0 #23632
Comments
I read this out general interest and it looks great to me. I wrote down some things that I thought were worth sharing will reading.
What does recovery mean exactly? if a kibana instance runs a tasks and it just takes too long, I presume it will be killed and marked as problematic.
Heads up - you can reduce the chance of this happening a lot, but you can't prevent it completely. It's important in task design to be aware of it and try to design safe operations. For example - index using predictable ids (rather than an auto-gened one) to prevent duplicates if the index operation is run twice.
This needs to use optimistic locking to make sure that another instance didn't "claim" the task and is running it after timeout.
Do I understand it correctly that every Kibana instance will also poll for tasks that have been marked as running X time ago (or it's runAt is X old) and force claim them? I think that's good but I would recommend having X being 1.5 times the timeout setting so it will give a live kibana instance some time to mark a task as timed out, when it truly does time out. Last I would also recommend that each Kibana instance will run it's periodic check on a randomized offset. For example, if |
Yeap. It's vague. Maybe we should just pull that line out of the description. Basically, I was just trying to say that the task manager has some strategy for handling tasks that don't complete in time. It's currently a naive strategy (retry a few times, then stop retrying if it fails too many times).
This is done using optimistic concurrency / locking, and no id generation is used, so I think it should be as safe as we can make it. To be precise, id generation may be used when the task is initially scheduled, but not on any of the reschedules.
I'm not sure. If a task author is conservative, and gives themselves a long timeout, I don't know that we want to multiply it. (e.g. 2 days becomes 3 days, which is an extreme example, but you get the drift). I think authors of tasks that have a risk of timing out should be intentional about the timeout value given.
That's an interesting idea. It's likely to already happen, as Kibana bootup times will not be predictive, but they'll probably be fairly close. The polling will diverge and converge over time, though, so I'm not sure how much we'll gain, but we will make an individual instance a bit harder to test and reason about. |
I think I wasn't specific enough. If a task has secondary output, like indexing a document of with it's result into ES, that document should have an id that is derived from the task inputs. For example, when ML analyze the data of a certain time slot, the result is stored as a document with an id that is derived from the job id and the time slot. This means that if two instances of the same task runs concurrently, only a single document will be created (and overridden).
You can go with a 50% increase with a maximum of 5 minutes. My concern here is that you need to give some time for the kibana instance that actually runs the index to process the timeout before another kibana instance takes over and runs it again.
I'm not sure what you mean exactly with diverge and converge. The exactly dynamics depend on your scheduling logic (for example if you schedule based on rounded wall clock time, you'd have that problem). Just something to think about. |
@stacey-gammon in order to not side track a separate issue, I'm going to reply to something you posted on an issue about saved object API and reporting:
I don't know if this question is tracked as open in some other issue. I'd say from Task Manager, it's not really a question of bloating the capabilities. We should keep storing the results of Reporting (generated payload, status, etc) because Reporting will work better when its own index instead of going through an abstraction. Task Manager has it's own index to represent scheduled tasks, and there is a string field for accumulated state and function parameters to the run function. Those are there to help Task Manager keep track of things on repeated runs of a recurring task. It would not be practical to store all the data needed for Reporting in that index. |
@tsullivan @njd5475 @chrisdavies can we close this now that the beta version has been merged? For additional features and improvements we still have #25271 open |
closed via #24356 |
Kibana Task Manager
Overview
We need a generic system for running background tasks in the Kibana server. It should support:
Implementation details
At a high level, task manager will:
createTaskRunner
method that returns an object withrun
andcancel
(optional) methods.{poll_interval}
milliseconds, check the{index}
for any task instances that need to be run:runAt
is pastattempts
is less than the configured thresholdrunning
runAt
to now + the timeout specified by the taskattempts
count and reschedule it for 5 minutes in the futurerunAt
value. We update the document of the task instance with the newrunAt
for the next iteration of the task.run
function doesn't return a newrunAt
value, remove the task instance document from{index}
and the task will not recur.runAt
field._id
Pooling
The task manager of a Kibana instance runs tasks in a pool which ensures that at most N tasks are run at a time, where N is configurable. This prevents the Kibana instance from running too many tasks at once in resource constrained environments.
In addition to this, task type definitions can configure tasks to run in smaller pools to limit how many tasks of a given type can be run at once.
Config options
The task_manager can be configured via
task_manager
config options (e.g.task_manager.maxAttempts
):max_attempts
- How many times a failing task instance will be retried before it is never run againpoll_interval
- How often the background worker should check the task_manager index for more workindex
- The name of the index that the task_managermax_workers
- The maximum number of tasks a Kibana will run concurrently (defaults to 10)override_num_workers
- Object to customize the number of workers occupied by specific tasks defined by the fields of the object (e.g. override_num_workers.reporting: 2)Task definitions
Plugins define tasks by adding a task type definition with the task manager service.
When Kibana attempts to claim and run a task instance, it looks its definition up, and executes its run method, passing it a run context which looks like this:
Task result
The task's run method is expected to return a promise that resolves to an object that conforms to the following interface:
State should be relatively small, as it is stored as a string of JSON. Large blobs in state will impact performance of migration and maintenance of the task manager index. If you need to store big values in state, you can instead write those values to your own index, and store the ids of the documents in state.
Timeouts
If the promise returned by the run function has a cancel method, the cancel method will be called if Kibana determines that the task has timed out via referencing the
timeOut
field of the task definition. The cancel method itself can return a promise, and Kibana will wait for the cancellation before attempting a re-run. Tasks can perform cleanup work here, if needed.Task instances
The task_manager module will store scheduled task instances in a configurable index. This allows recovery of failed tasks, coordination across Kibana clusters, etc.
The data for a task instance is stored and passed as context to the
run
functions in a way that looks something like this:Programmatic access
The task manager plugin exposes an object in its namespace on the server object, which plugins can use to manage scheduled tasks.
Middleware
Plugins will be able to augment task instances data with
beforeSchedule
hooks, and modify run context withbeforeRun
hooks.Limitations in v1.0.0
callWithRequest
in a run function as run is not in a request context. UsingcallWithInternalUser
is possible. It will also be possible for a plugin to add username/password or other authorization config to kibana.yml and connect to Elasticsearch with those credentials, or use a token to talk to a 3rd party service. Ideally, there will be support in Elasticsearch for generating long-running authentication tokens on behalf of a requesting user.numWorkers
value for a task definition is a best-guess approach.The text was updated successfully, but these errors were encountered: