Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job submission API #268

Closed
trws opened this issue Jul 13, 2015 · 8 comments
Closed

Job submission API #268

trws opened this issue Jul 13, 2015 · 8 comments

Comments

@trws
Copy link
Member

trws commented Jul 13, 2015

While the tool can be prototyped with the 'flux-submit' tool, to be practical it will require a way to submit jobs directly to the system. If we need to design this, it seems as though it would be worth designing an interface for job submission, something that different schedulers or perhaps even a core component can implement and provide a common target.

Requirements:

  • An expected response, probably an RPC or request to the scheduler
  • A mechanism to create, and populate, a job request
@trws trws added this to the Flux in-job scheduling tool milestone Jul 13, 2015
@dongahn
Copy link
Member

dongahn commented Jul 14, 2015

Just to further the discussion a bit:

It looks like the job service module within flux-core will invoke a callback upon receiving a "job.create" request, creates a new job id, sets the state of this job to "reserved," and then responds back to the requester with the new job id.

Upon receiving this response, flux-submit will set the state to "submitted," and this state transition (reserved->submitted) should trigger the schedule action.

Now, if the job module itself can write the "submitted" state on processing successful job-creation request before responding back to the request, this should essentially wake up both the flux-sched's scheduler framework service and the requester at the same time. -- We'll need some error handling of course...

I haven't tested this, but this should allow you to bypass flux-submit and create your own submission tool... Is this along the line you are thinking about?

I am not sure if this is enough to get the throughput you want, though. You may have discussed this a bit at today's meeting, but perhaps one needs a way to batch a large number of job.create requests and submit it to job module, which then emits the 'bulk submit' event... and also respond back the request with the job list (e.g., 4-45, 47-409). Perhaps the current job module can also be modified to support such bulk job request as well.

On the second requirement, i think the job module currently expects an JSON object with the number of nodes and processes, which is piggybacked with a job.create request. Would you like to deal with resource shape more than the number of nodes and processes now? And/or would you like to have more fine-grained job request creation type beyond this single job.create request?

Either sounds reasonable in particular with actual use cases. But I would be careful in the resource specification design in terms of its extendibility as our job schema has not yet been defined.

@trws
Copy link
Member Author

trws commented Jul 27, 2015

After today's discussion, it seems like what we need here is a protocol for job submission, ideally defining a base protocol and then allowing extensions such as described in #293. As a first cut, we can determine the base protocol, then work on extensions of it later. This also ties into the need for protocol/interface definitions for querying resources and interacting with resource instances across the project. I'll be adding tickets for these others shortly.

@lipari
Copy link
Contributor

lipari commented Jul 28, 2015

The big picture can be broken down into five pieces.
One is to describe a set of resources that will be needed by a job.
The second is to discover whether those resources exist or are "feasible" (#269) .
The third is to discover whether those resources are available to run a job (i.e., not reserved or running other jobs).
The fourth is to discover whether the user is permitted by policy to run a job on those resources.
The fifth is to issue the request to actually launch a job on the requested set of resources.

The answers to the questions above depend on the meaning of "submit" and the intended outcome.

I know we're talking semantics here, but in my mind, "submit" implies a queue. When you submit a job to a system, whether it be for immediate execution or to a batch queue, the assumption is that some processing needs to occur before your job is allowed to run. This contrasts with "run" which implies a command for immediate execution on specified resources.

If the outcome is to instantiate a Flux program on the requested resources and run immediately - i.e, "launch this program now", the assumption for this action is that the resources are available and allowed to run a job for an indefinite period of time.

The other outcome is for anything other than "launch this program now" and falls to a scheduling service that, by design, we have relegated to a service outside of flux-core. At that point, this issue will indeed apply to the flux-sched side and serve to design/enhance "flux submit".

All that said, what I believe this issue is about is just the first and fifth pieces described at the top of this message and does not include the other pieces which would qualify or schedule the request.

One other point... other schedulers have a notion of a "job array". This covers the case where a number of jobs are submitted and run in a bulk fashion. Typically a job quantity is specified, an executable is named, and the means to specify the input and output files for each job. If we were to include a bulk ability to our job submit protocol, then I would suggest using the term "array" instead of "bulk".

@trws
Copy link
Member Author

trws commented Jul 28, 2015

It also requires the second, to use your numbering scheme. What I'm looking for is a way to generically submit a job to whatever mechanism is available that should service that request. At the bare minimum, that may be a blocking queue of length one that simply does a feasibility check, an availability check, and either blocks or fails if the feasibility check passes and the availability check fails. When the availability check passes, the resources are allocated and the job is run. This is distinct from a "run" service that runs a given set of tasks on a specific set of pre-determined resources.

@lipari
Copy link
Contributor

lipari commented Jul 28, 2015

Ok, good, thanks for clarifying. It seems to me that you're describing is a simple FIFO scheduler.

@grondo
Copy link
Contributor

grondo commented Jul 28, 2015

What if we focus this Issue on creating a protocol spec (rfc-style) for the job submission API @trws requires, then decide where to put the result and the utilities that will use it after? I don't feel like I have a handle on what the protocol might look like, but it feels like it would be pretty simple for this base case.

Maybe we can start with the simplest RPC definition, and refine from there?

I have to read through JSON Content Rules RFC before getting the form correct, but as a start, perhaps an off-the-cuff example would get us started?

topic: "job.request"
payload: {
  "cmdline"   : [ "program", "args", ],
  "cwd"       : "path",
  "environ"   : {  "var": "value", }, # (overrides or complete)
  "resources" : { request-obj },      # resource request object to be defined,
                                      #  e.g. { "ntasks" : N }
  "options"   : { option: "value" },  # extensible options dictionary    
}

job.request reply:
payload: {
   "id"       : "ID"      # On success new, non-zero id returned
   "errmsg"   : "string"  # On failure, an error message is returned
}

That is pretty basic, but I suppose we have to start somewhere.
On successful return, the caller should be able to expect that their job entry in kvs has been populated with the basic parameters from their reply + anything added by additional job submission filters, etc. Maybe the reply should include a path in the KVS?

@trws
Copy link
Member Author

trws commented Jul 28, 2015

That looks like a good start to me, add in a KVS path as you suggest and a KVS root version and I think we'd have the base of it. I'm thinking it would also be good to have a little API library for this, just basic create and submit functions, that kind of thing, so a user can run a simple job with defaults in just a few calls but if they want can arbitrarily manipulate it between the two.

@trws
Copy link
Member Author

trws commented Jul 28, 2015

Ah, and probably adding a job.features request, to receive a listing of additional options that the handler for this interface implements support for.

topic: "job.features"
payload: N/A

job.features reply:
payload: {
"option" : "value",...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants