Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Adds a reschedule api to Task Manager #50718

Closed

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented Nov 14, 2019

Summary

The is no way, at the moment, to update a task's scheduling.
This is a problem for Alerting, as it means you can't reschedule when the recurring check for an alert will take place.

closes #45152

This is issue is made worse by the fact that Alerting doesn't use Task Manager's interval field and implement its own, which means a new interval only takes affect after the next run of an alert.
Once this PR is merged, we can pick up #46001 which will change alerting to use this new api in order to address this additional issue.

Checklist

Use strikethroughs to remove checklist items you don't feel are applicable to this PR.

For maintainers

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services (Team:Stack Services)

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@@ -146,7 +146,7 @@ describe('TaskManager', () => {
expect(result.id).toEqual('my-foo-id');
});

test('doesnt ignore failure to scheduling existing tasks for reasons other than already being scheduled', async () => {
test('doesnt ignore failure to schedule existing tasks for reasons other than already being scheduled', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is unchanged, just its label.

@gmmorris gmmorris changed the title [DRAFT] Adds a reschedule api to Task Manager [Task Manager] Adds a reschedule api to Task Manager Nov 20, 2019
@gmmorris gmmorris marked this pull request as ready for review November 20, 2019 10:25
@gmmorris gmmorris requested a review from a team November 20, 2019 10:25
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are looking good! Just a few comments / nits.

x-pack/legacy/plugins/task_manager/task.ts Outdated Show resolved Hide resolved
taskInstanceScheduling: TaskInstanceScheduling
): Promise<ConcreteTaskInstance> {
const taskInstance = await this.getTask(taskInstanceScheduling.id);
return await this.update(
Copy link
Contributor

@mikecote mikecote Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this section may become a bit more tricky if we want to further reduce the possibility of getting 409 errors. We should evaluate if it's worth worrying about but I'll write some thoughts below.

Some tasks can change state very often and cause such errors. For example, it's possible a task could be updated from elsewhere between the getTask call and update call. Some of those changes include:

  • Task moving from idle to claiming
  • Task moving from claiming to running
  • Task moving from running to idle

If we think this is worth pursuing, some options may include:

  • Updating only the interval attribute in a separate call without the version attribute (guaranteed no 409s Reduction in the possibility of 409s, leaving retry_on_conflict as the next solution if we get there)
  • Re-attempting a save if ever a 409 has been received (though this could be a repeated error, small odds)
  • Some other logic about trying to update the runAt

Hopefully if we change things, the implementations are simple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I removed the guaranteed no 409s section from my comment as this is not true. They would only happen if ever at a low level a conflict is detected (since Lucene operation is a read then write). retry_on_conflict would be the next way around those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is what I'm still working on now...
At the moment my blocker is I can't setup a reliable functional test that mimics this... so, it's on going.

I'm not yet sure what the ideal approach would be, but that does raise that we can't skip updating the runAt as that means we might end up with the same problem we have now which is that interval won't be applied until the next runAt expires if get gave us a task with running than then became idle by the time update happened.

This is why my instinct is actually a retry where we can reevaluate which fields to update. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've introduced a retry on version conflict - but it'll only retry twice, then it'll give up and bubble up the error.

What do you think?

Copy link
Contributor

@mikecote mikecote Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes you've done will definitely help reduce the odds of the task not getting an updated interval. I guess if ever it fails to attempt twice, the reschedule function throws an error? Which I think can be used to notify the user of the very small chance something went wrong and to try their request again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, if it fails after the second attempt the reschedule call will throw an error.
Whoever is using it will have to handle it.

In the case of the Alerting api that should bubble up to the api reply, but I'll double check.

x-pack/legacy/plugins/task_manager/task_store.ts Outdated Show resolved Hide resolved
Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM

made some notes on cleanliness / completeness (reschedule result during running | failed), could be done later if we want to do them ...

If a request is made to `reschedule` the `runAt` field of an `idle` task, irrespective of whether the `interval` field is also specified, the task's field(s) will be updated and the task will only run when the newly specified `runAt` expires (after which, any new `interval` that may also have been specified, will be applied to the _next_ scheduled run).
These behaviors mirrors how `schedule` treats these two fields.

Where this behavior diverges is if a request is made to `reschedule` a non `idle` task, such as a `running` task or a `failed` task.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior seems acceptable; basically the task is running right now, or will likely again soon (failed so retrying soon), so a new runAt calculation probably isn't needed anyway.

However

recommend using the Task returned by the reschedule api to assess whether the fields have been updated as expected

is kind of a cop-out. I guess you'd compare the original runAt (which you would have to get first) to the one returned, and make some kind of decision (what would you even do, reschedule again?). I think it should probably be easier to decipher for the caller. Augment the result with a property indicating the runAt was not changed, because [was running | was failed | ???].

If that sounds right, would be fine to do as a follow-on PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point.
I'll give it some thought... but I'm not sure how we'd augment a result, especially considering none of the other operations have such a thing 🤔I'll see if anything makes sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, no sense holding up the PR noodling on doing something here, open an issue if you think it's appropriate. I can't imagine we will really need this immediately ...

retryAt: (doc.retryAt && doc.retryAt.toISOString()) || null,
runAt: (doc.runAt || new Date()).toISOString(),
status: (doc as ConcreteTaskInstance).status || 'idle',
...mapValues(pick(doc, isPlainObject), objectProp => JSON.stringify(objectProp)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is somewhat fragile, in that if we ever added a new object property to the TM SO, it would end up getting JSON.stringify'd, which we wouldn't want. I guess we'd figure it out, as ES would complain that we're giving it a string when it expected an object - but ... who knows! :-)

Also, the previous code seems to be defaulting a bunch of values that the new code isn't ... the non-null ones seem like maybe they're important? attempts and idle? Oh wait, this is only called by taskInstanceToAttributes() which calls applyConcreteTaskInstanceDefaults() first. Perhaps we should just inline the code here in taskInstanceToAttributes() to make that a little more clear?

Also, not sure we even need the dateProp.toISOString(). JSON.stringify() will turn a Date object into an ISO string. Perhaps it's needed elsewhere as the ISO string in other processing tho, and we do need to do it here. If so, a comment would be appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the fragility - I thought we were stringifying because we cant store objects there - any idea why we're stringifying at all then?

regarding defaulting- it should be the same as before, but I separate them on purpose because setting defaults and serialising are two different things and it's harder to maintain code that does two unrelated things.
The reason I'm not inlining them is that we need applyConcreteTaskInstanceDefaults independently when we're omitting and merging data from partial updates (see the reschedule method).

As for toISOString - that's what we were already doing, so I kept it the same. I could investigate why we use it specifically, but either way changing it is probably for a separate PR, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is that a SO restriction? You can't have object type properties, with say an enabled: false (to prevent index explosion)? If so, then ya, every object would need to be stringified. If it's not an SO restriction, seems like we shoulda used object enabled=false for these, but then that's water under the bridge, migrating is a possibility, for a future PR :-)

As mentioned, even though it may be fragile if we extend the SO later, I think we'd find out pretty quickly (ES wouldn't index the doc), so not a big deal.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM!

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a thought about another scenario I think we have to handle. Whenever we update the interval of a "running" task, this means task manager holds on to an outdated version in one of the Kibana instances. We'll have to make the other end that marks the task as done able to handle 409 errors.

Just writing some notes that we can chat about tomorrow but I think updating tasks that are running will have a problem.

@gmmorris
Copy link
Contributor Author

Just had a thought about another scenario I think we have to handle. Whenever we update the interval of a "running" task, this means task manager holds on to an outdated version in one of the Kibana instances. We'll have to make the other end that marks the task as done able to handle 409 errors.

Just writing some notes that we can chat about tomorrow but I think updating tasks that are running will have a problem.

oh, that's a good shout.
I'll look into it.

@gmmorris
Copy link
Contributor Author

closed as we don't feel we can support this yet.
Work done from runNow should make this easier in the future

@gmmorris gmmorris closed this Dec 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DISCUSS] Task manager update API to allow changing a task's interval
4 participants