minimal flux-submit and job service #1332

garlick · 2018-02-12T23:36:05Z

Implement a minimal job service that accepts signed job requests and "enqueues" them in the KVS per the job scheme described in RFC 16.

Add a flux-submit command that converts user command line arguments into J, as described in RFC 15

Use YAML jobpec described in RFC 14.

Finally, add a flux-joblist or similar command for listing jobs.

The complete resource model need not be implemented at first. Simply supporting cores would be a start.

Implement a notification request where scheduler or other service can send in a sequence number of "last job request received", and obtain a block of job requests that have since arrived. Request blocks when there are no new requests available.

garlick · 2018-05-22T17:57:52Z

Summarizing some discussion from yesterday:

Instead of reworking flux submit which is in use with wreck, let's add a flux job command with subcommands for submit, list, etc.
Although RFC 16 separately calls out an ingest agent and job management service, they can be combined in a single job-manager service
job-manager should be distributed (e.g. loaded on all ranks) for scalability
job-manager should be restartable, e.g. it should recover the job queue, state from the KVS
The KVS layout for active and inactive jobs is described in RFC 16. The set of active jobs is essentially "the queue".
there should be a scheme for registering validators with the job-manager, e.g. external entities which can check job requests for conformance to site policy, etc..
validation failure should be reflected in a submit error, not later when the job is scheduled.
the submission process should assign a jobid to the job; alternatively a more scalable decentralized scheme like the one disucussed in RFC: Replace monotonic job sequence numbers with distributed unique id service #470 could assign jobid's on the submission client end

dongahn · 2018-05-22T19:26:17Z

These look good. I will want to hear more details of these from the perspective of scheduler integration.

Just a few points:

Instead of reworking flux submit which is in use with wreck, let's add a flux job command with subcommands for submit, list, etc.

From working with @koning to support his emerging workflow, it seems pretty important to formalize the submit RPC in addition to the user-facing cmds like these new flux job commands.

He said being able to use rich sets of APIs is one of the significant advantages of Flux compared to other RMs.

Generally speaking, some users will want to directly use RPCs to the services to submit a job from their workflow tool (e.g., written in python) and then monitor status changes of those jobs through their lifecycles.

BTW, as I was helping him, I wasn't sure how we can allow user to subscribe to the events of a single job without creating a race condition. So, we may want to think about putting this into job submit design such that users can submit and register a job status callback atomically.

grondo · 2018-05-22T20:14:09Z

BTW, as I was helping him, I wasn't sure how we can allow user to subscribe to the events of a single job without creating a race condition. So, we may want to think about putting this into job submit design such that users can submit and register a job status callback atomically.

Was an issue opened for this problem? I'm probably missing something, but you can subscribe to events for a job even in the current wreck system by first using job.create then job.submit-nocreate. This is how flux-wreckrun works when sched is loaded so I'm surprised it didn't work for you.

I think for the replacement we plan to do even better by keeping a log of state transitions so subscribers can "catch up" at any point.

garlick · 2018-05-22T20:15:07Z

I will want to hear more details of these from the perspective of scheduler integration

Yes, I think we'll need to work together on that design. Since we want to support a high ingest rate without inundating the entire session with broadcast events, it will likely be more efficient for sched to ask job-manager to track queue growth for it. In other words provide an RPC that sched would use to fetch batches of new active jobs as they are created, rather than an event per job creation. The RPC would block when no jobs are avaialble and sched would use reactive programming to manage the async response.

it seems pretty important to formalize the submit RPC

Yes I should have mentioned that above. We discussed this, and will definitely make it a goal. Probably flux job submit will just be a user of this API.

I wasn't sure how we can allow user to subscribe to the events of a single job without creating a race condition

Yep as @grondo suggested, let's get an issue open on this one.

grondo · 2018-05-22T20:24:08Z

the submission process should assign a jobid to the job; alternatively a more scalable decentralized scheme like the one disucussed in #470 could assign jobid's on the submission client end

I don't think the FLUID scheme proposed in #420 would allow the submission client to assign jobids. This type of distributed unique ID still requires a sequence number, but the number would be kept per rank (or on a series of ranks) instead of per instance, so the generator could be embedded in the job-manager service, which wouldn't have to fetch a global sequence number for each job.

Allowing the client to propose their own jobids is an interesting idea though, but then the job ingest service would have to verify uniqueness which might undo any scalability gains from pushing off the id generation to the client...

garlick · 2018-05-22T20:33:24Z

but then the job ingest service would have to verify uniqueness

Excellent point!

garlick · 2018-06-14T17:04:43Z

One of the design points mentioned above

Although RFC 16 separately calls out an ingest agent and job management service, they can be combined in a single job-manager service

was discussed further offline (@grondo and me). As reported in #1543, we thought it would be better to keep the original idea of a separate ingest module and manager module:

Rename job-manager to job-ingest per discussion with @grondo. We'll keep ingest separate from the management functionality, so that we can load job-ingest across an instance and possibly load job manager (which may have a larger memory footprint) on a subset of nodes.

In addition, I proposed the job-ingest module would

Issue an event containing a batch of new jobid's when batch KVS commits complete

The event would be the mechanism by which distributed ingest modules (using FLUID jobids) would notify manager module(s) that new jobs have been ingested. The manager module would in turn interface with the scheduler and user tools.

As I recall we brainstormed a bit on the manager module and how it would interact with tools and the scheduler. One idea from @grondo was that it might eventually support SQL queries on jobs, including completed jobs, similar to the sqlog add-on to SLURM. Pondering this further, it may be that job listing and even the scheduler interface could be usefully built on SQL queries.

grondo · 2018-06-14T17:30:58Z

Pondering this further, it may be that job listing and even the scheduler interface could be usefully built on SQL queries.

I was thinking something similar, but wondered if trying to define a rigid schema for job data might reduce our flexibility. I wonder if there is a document database we could leverage as simply as SQLite that we could use to stand up something very quicly, but grow as the needs of the scheduler and job query tools evolve?

garlick · 2019-07-17T22:50:06Z

For the most part this issue is resolved, although there is a some good discussion here.

garlick mentioned this issue Feb 12, 2018

minimal scheduler #1333

Closed

garlick self-assigned this May 22, 2018

garlick added the in progress label May 22, 2018

garlick mentioned this issue May 22, 2018

create a v0.1.0 release? flux-framework/flux-security#72

Closed

dongahn mentioned this issue May 23, 2018

Atomic job submit and job eventing subscription and etc #1534

Closed

garlick mentioned this issue Jun 14, 2018

job-ingest: add module for accepting submitted jobs #1543

Closed

garlick closed this as completed Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

minimal flux-submit and job service #1332

minimal flux-submit and job service #1332

garlick commented Feb 12, 2018 •

edited

Loading

garlick commented May 22, 2018

dongahn commented May 22, 2018

grondo commented May 22, 2018

garlick commented May 22, 2018

grondo commented May 22, 2018

garlick commented May 22, 2018

garlick commented Jun 14, 2018

grondo commented Jun 14, 2018

garlick commented Jul 17, 2019

minimal flux-submit and job service #1332

minimal flux-submit and job service #1332

Comments

garlick commented Feb 12, 2018 • edited Loading

garlick commented May 22, 2018

dongahn commented May 22, 2018

grondo commented May 22, 2018

garlick commented May 22, 2018

grondo commented May 22, 2018

garlick commented May 22, 2018

garlick commented Jun 14, 2018

grondo commented Jun 14, 2018

garlick commented Jul 17, 2019

garlick commented Feb 12, 2018 •

edited

Loading