-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: implement system batch scheduler #9160
Conversation
TODO
|
Something else to consider here is that all job types have a job detail page in the web UI tailored to the job type, (e.g., service jobs have deployments, parameterized children have a payload). System batch jobs should also have a tailored job detail page. What features of system batch need to be called out? Before this lands we should coordinate with @DingoEatingFuzz to work through what's needed for that design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shoenig ! I like the approach taken here. Looks like the conflicts are minimal.
Have few questions:
-
do we need to add any special logic for
reschedule
block? I assume it doesn't apply to system batch jobs, should we add any validation? -
Does this PR support periodic/parameterized system jobs? If so, might be nice to add tests or call it out.
-
What's the intended behavior for new nodes? When should the system batch job be considered "complete"? I envision use cases where system batch jobs should run on the nodes matching the constraints at submission time but not for new nodes and where running them on new nodes can be "surprising". This might be an interesting intersection with parameterized/dispatch system batch jobs, if supported.
This PR implements a new "System Batch" scheduler type. Jobs can make use of this new scheduler by setting their type to 'sysbatch'. Like the name implies, sysbatch can be thought of as a hybrid between system and batch jobs - it is for running short lived jobs intended to run on every compatible node in the cluster. As with batch jobs, sysbatch jobs can also be periodic and/or parameterized dispatch jobs. A sysbatch job is considered complete when it has been run on all compatible nodes until reaching a terminal state (success or failed on retries). Feasibility and preemption are governed the same as with system jobs. In this PR, the update stanza is not yet supported. The update stanza is sill limited in functionality for the underlying system scheduler, and is not useful yet for sysbatch jobs. Further work in #4740 will improve support for the update stanza and deployments. Closes #2527
Use basic sleeps in busybox images. busybox are very light, and ping has permissions complications, and it may fail for network related issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging and will iterate on this.
Also, double checked - periodic/parameterized system batch jobs are supported. And system batch jobs only run on the nodes that are ready at the time of scheduling; system batch jobs don't run on nodes that got added after the initial scheduling, unless operator triggers an job eval (e.g. resubmitting the job, etc).
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. Because the existing system scheduler does not support deployments, sysbatch jobs cannot yet make use of deployments either. (Described in #4740)
Closes #2527