-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job submission API #268
Comments
Just to further the discussion a bit: It looks like the job service module within flux-core will invoke a callback upon receiving a "job.create" request, creates a new job id, sets the state of this job to "reserved," and then responds back to the requester with the new job id. Upon receiving this response, flux-submit will set the state to "submitted," and this state transition (reserved->submitted) should trigger the schedule action. Now, if the job module itself can write the "submitted" state on processing successful job-creation request before responding back to the request, this should essentially wake up both the flux-sched's scheduler framework service and the requester at the same time. -- We'll need some error handling of course... I haven't tested this, but this should allow you to bypass flux-submit and create your own submission tool... Is this along the line you are thinking about? I am not sure if this is enough to get the throughput you want, though. You may have discussed this a bit at today's meeting, but perhaps one needs a way to batch a large number of job.create requests and submit it to job module, which then emits the 'bulk submit' event... and also respond back the request with the job list (e.g., 4-45, 47-409). Perhaps the current job module can also be modified to support such bulk job request as well. On the second requirement, i think the job module currently expects an JSON object with the number of nodes and processes, which is piggybacked with a job.create request. Would you like to deal with resource shape more than the number of nodes and processes now? And/or would you like to have more fine-grained job request creation type beyond this single job.create request? Either sounds reasonable in particular with actual use cases. But I would be careful in the resource specification design in terms of its extendibility as our job schema has not yet been defined. |
After today's discussion, it seems like what we need here is a protocol for job submission, ideally defining a base protocol and then allowing extensions such as described in #293. As a first cut, we can determine the base protocol, then work on extensions of it later. This also ties into the need for protocol/interface definitions for querying resources and interacting with resource instances across the project. I'll be adding tickets for these others shortly. |
The big picture can be broken down into five pieces. The answers to the questions above depend on the meaning of "submit" and the intended outcome. I know we're talking semantics here, but in my mind, "submit" implies a queue. When you submit a job to a system, whether it be for immediate execution or to a batch queue, the assumption is that some processing needs to occur before your job is allowed to run. This contrasts with "run" which implies a command for immediate execution on specified resources. If the outcome is to instantiate a Flux program on the requested resources and run immediately - i.e, "launch this program now", the assumption for this action is that the resources are available and allowed to run a job for an indefinite period of time. The other outcome is for anything other than "launch this program now" and falls to a scheduling service that, by design, we have relegated to a service outside of flux-core. At that point, this issue will indeed apply to the flux-sched side and serve to design/enhance "flux submit". All that said, what I believe this issue is about is just the first and fifth pieces described at the top of this message and does not include the other pieces which would qualify or schedule the request. One other point... other schedulers have a notion of a "job array". This covers the case where a number of jobs are submitted and run in a bulk fashion. Typically a job quantity is specified, an executable is named, and the means to specify the input and output files for each job. If we were to include a bulk ability to our job submit protocol, then I would suggest using the term "array" instead of "bulk". |
It also requires the second, to use your numbering scheme. What I'm looking for is a way to generically submit a job to whatever mechanism is available that should service that request. At the bare minimum, that may be a blocking queue of length one that simply does a feasibility check, an availability check, and either blocks or fails if the feasibility check passes and the availability check fails. When the availability check passes, the resources are allocated and the job is run. This is distinct from a "run" service that runs a given set of tasks on a specific set of pre-determined resources. |
Ok, good, thanks for clarifying. It seems to me that you're describing is a simple FIFO scheduler. |
What if we focus this Issue on creating a protocol spec (rfc-style) for the job submission API @trws requires, then decide where to put the result and the utilities that will use it after? I don't feel like I have a handle on what the protocol might look like, but it feels like it would be pretty simple for this base case. Maybe we can start with the simplest RPC definition, and refine from there? I have to read through JSON Content Rules RFC before getting the form correct, but as a start, perhaps an off-the-cuff example would get us started?
That is pretty basic, but I suppose we have to start somewhere. |
That looks like a good start to me, add in a KVS path as you suggest and a KVS root version and I think we'd have the base of it. I'm thinking it would also be good to have a little API library for this, just basic |
Ah, and probably adding a topic: "job.features"
payload: N/A
job.features reply:
payload: {
"option" : "value",...
} |
While the tool can be prototyped with the 'flux-submit' tool, to be practical it will require a way to submit jobs directly to the system. If we need to design this, it seems as though it would be worth designing an interface for job submission, something that different schedulers or perhaps even a core component can implement and provide a common target.
Requirements:
The text was updated successfully, but these errors were encountered: