-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] Adds a reschedule
api to Task Manager
#50718
[Task Manager] Adds a reschedule
api to Task Manager
#50718
Conversation
💚 Build Succeeded |
Pinging @elastic/kibana-stack-services (Team:Stack Services) |
💚 Build Succeeded |
💚 Build Succeeded |
💚 Build Succeeded |
…te-scheduled-task
💚 Build Succeeded |
💔 Build Failed |
💚 Build Succeeded |
@@ -146,7 +146,7 @@ describe('TaskManager', () => { | |||
expect(result.id).toEqual('my-foo-id'); | |||
}); | |||
|
|||
test('doesnt ignore failure to scheduling existing tasks for reasons other than already being scheduled', async () => { | |||
test('doesnt ignore failure to schedule existing tasks for reasons other than already being scheduled', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is unchanged, just its label.
reschedule
api to Task Managerreschedule
api to Task Manager
💚 Build Succeeded |
💚 Build Succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes are looking good! Just a few comments / nits.
taskInstanceScheduling: TaskInstanceScheduling | ||
): Promise<ConcreteTaskInstance> { | ||
const taskInstance = await this.getTask(taskInstanceScheduling.id); | ||
return await this.update( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this section may become a bit more tricky if we want to further reduce the possibility of getting 409 errors. We should evaluate if it's worth worrying about but I'll write some thoughts below.
Some tasks can change state very often and cause such errors. For example, it's possible a task could be updated from elsewhere between the getTask
call and update
call. Some of those changes include:
- Task moving from
idle
toclaiming
- Task moving from
claiming
torunning
- Task moving from
running
toidle
If we think this is worth pursuing, some options may include:
- Updating only the
interval
attribute in a separate call without theversion
attribute (guaranteed no 409sReduction in the possibility of 409s, leavingretry_on_conflict
as the next solution if we get there) - Re-attempting a save if ever a 409 has been received (though this could be a repeated error, small odds)
- Some other logic about trying to update the
runAt
Hopefully if we change things, the implementations are simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: I removed the guaranteed no 409s
section from my comment as this is not true. They would only happen if ever at a low level a conflict is detected (since Lucene operation is a read then write). retry_on_conflict
would be the next way around those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is what I'm still working on now...
At the moment my blocker is I can't setup a reliable functional test that mimics this... so, it's on going.
I'm not yet sure what the ideal approach would be, but that does raise that we can't skip updating the runAt
as that means we might end up with the same problem we have now which is that interval
won't be applied until the next runAt
expires if get
gave us a task with running
than then became idle
by the time update
happened.
This is why my instinct is actually a retry where we can reevaluate which fields to update. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've introduced a retry on version conflict - but it'll only retry twice, then it'll give up and bubble up the error.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes you've done will definitely help reduce the odds of the task not getting an updated interval. I guess if ever it fails to attempt twice, the reschedule function throws an error? Which I think can be used to notify the user of the very small chance something went wrong and to try their request again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly, if it fails after the second attempt the reschedule
call will throw an error.
Whoever is using it will have to handle it.
In the case of the Alerting api that should bubble up to the api reply, but I'll double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM
made some notes on cleanliness / completeness (reschedule result during running | failed), could be done later if we want to do them ...
If a request is made to `reschedule` the `runAt` field of an `idle` task, irrespective of whether the `interval` field is also specified, the task's field(s) will be updated and the task will only run when the newly specified `runAt` expires (after which, any new `interval` that may also have been specified, will be applied to the _next_ scheduled run). | ||
These behaviors mirrors how `schedule` treats these two fields. | ||
|
||
Where this behavior diverges is if a request is made to `reschedule` a non `idle` task, such as a `running` task or a `failed` task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior seems acceptable; basically the task is running right now, or will likely again soon (failed so retrying soon), so a new runAt
calculation probably isn't needed anyway.
However
recommend using the Task returned by the
reschedule
api to assess whether the fields have been updated as expected
is kind of a cop-out. I guess you'd compare the original runAt
(which you would have to get first) to the one returned, and make some kind of decision (what would you even do, reschedule again?). I think it should probably be easier to decipher for the caller. Augment the result with a property indicating the runAt
was not changed, because [was running
| was failed
| ???].
If that sounds right, would be fine to do as a follow-on PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point.
I'll give it some thought... but I'm not sure how we'd augment a result, especially considering none of the other operations have such a thing 🤔I'll see if anything makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, no sense holding up the PR noodling on doing something here, open an issue if you think it's appropriate. I can't imagine we will really need this immediately ...
retryAt: (doc.retryAt && doc.retryAt.toISOString()) || null, | ||
runAt: (doc.runAt || new Date()).toISOString(), | ||
status: (doc as ConcreteTaskInstance).status || 'idle', | ||
...mapValues(pick(doc, isPlainObject), objectProp => JSON.stringify(objectProp)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is somewhat fragile, in that if we ever added a new object
property to the TM SO, it would end up getting JSON.stringify
'd, which we wouldn't want. I guess we'd figure it out, as ES would complain that we're giving it a string when it expected an object - but ... who knows! :-)
Also, the previous code seems to be defaulting a bunch of values that the new code isn't ... the non-null
ones seem like maybe they're important? attempts
and idle
? Oh wait, this is only called by taskInstanceToAttributes()
which calls applyConcreteTaskInstanceDefaults()
first. Perhaps we should just inline the code here in taskInstanceToAttributes()
to make that a little more clear?
Also, not sure we even need the dateProp.toISOString()
. JSON.stringify()
will turn a Date object into an ISO string. Perhaps it's needed elsewhere as the ISO string in other processing tho, and we do need to do it here. If so, a comment would be appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the fragility - I thought we were stringifying because we cant store objects there - any idea why we're stringifying at all then?
regarding defaulting- it should be the same as before, but I separate them on purpose because setting defaults and serialising are two different things and it's harder to maintain code that does two unrelated things.
The reason I'm not inlining them is that we need applyConcreteTaskInstanceDefaults
independently when we're omitting and merging data from partial updates (see the reschedule
method).
As for toISOString
- that's what we were already doing, so I kept it the same. I could investigate why we use it specifically, but either way changing it is probably for a separate PR, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, is that a SO restriction? You can't have object
type properties, with say an enabled: false
(to prevent index explosion)? If so, then ya, every object would need to be stringified. If it's not an SO restriction, seems like we shoulda used object enabled=false
for these, but then that's water under the bridge, migrating is a possibility, for a future PR :-)
As mentioned, even though it may be fragile if we extend the SO later, I think we'd find out pretty quickly (ES wouldn't index the doc), so not a big deal.
x-pack/test/plugin_api_integration/test_suites/task_manager/task_manager_integration.js
Show resolved
Hide resolved
💚 Build Succeeded |
💔 Build Failed |
💔 Build Failed |
💚 Build Succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a thought about another scenario I think we have to handle. Whenever we update the interval
of a "running" task, this means task manager holds on to an outdated version in one of the Kibana instances. We'll have to make the other end that marks the task as done able to handle 409 errors.
Just writing some notes that we can chat about tomorrow but I think updating tasks that are running will have a problem.
oh, that's a good shout. |
closed as we don't feel we can support this yet. |
Summary
The is no way, at the moment, to update a task's scheduling.
This is a problem for Alerting, as it means you can't reschedule when the recurring check for an alert will take place.
closes #45152
This is issue is made worse by the fact that Alerting doesn't use Task Manager's
interval
field and implement its own, which means a new interval only takes affect after the next run of an alert.Once this PR is merged, we can pick up #46001 which will change alerting to use this new api in order to address this additional issue.
Checklist
Use
strikethroughsto remove checklist items you don't feel are applicable to this PR.This was checked for cross-browser compatibility, including a check against IE11Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n supportThis was checked for keyboard-only and screenreader accessibilityFor maintainers