-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minimal flux-submit and job service #1332
Comments
Summarizing some discussion from yesterday:
|
These look good. I will want to hear more details of these from the perspective of scheduler integration. Just a few points:
From working with @koning to support his emerging workflow, it seems pretty important to formalize the submit RPC in addition to the user-facing cmds like these new He said being able to use rich sets of APIs is one of the significant advantages of Flux compared to other RMs. Generally speaking, some users will want to directly use RPCs to the services to submit a job from their workflow tool (e.g., written in python) and then monitor status changes of those jobs through their lifecycles. BTW, as I was helping him, I wasn't sure how we can allow user to subscribe to the events of a single job without creating a race condition. So, we may want to think about putting this into job submit design such that users can submit and register a job status callback atomically. |
Was an issue opened for this problem? I'm probably missing something, but you can subscribe to events for a job even in the current wreck system by first using I think for the replacement we plan to do even better by keeping a log of state transitions so subscribers can "catch up" at any point. |
Yes, I think we'll need to work together on that design. Since we want to support a high ingest rate without inundating the entire session with broadcast events, it will likely be more efficient for sched to ask
Yes I should have mentioned that above. We discussed this, and will definitely make it a goal. Probably
Yep as @grondo suggested, let's get an issue open on this one. |
I don't think the FLUID scheme proposed in #420 would allow the submission client to assign jobids. This type of distributed unique ID still requires a sequence number, but the number would be kept per rank (or on a series of ranks) instead of per instance, so the generator could be embedded in the job-manager service, which wouldn't have to fetch a global sequence number for each job. Allowing the client to propose their own jobids is an interesting idea though, but then the job ingest service would have to verify uniqueness which might undo any scalability gains from pushing off the id generation to the client... |
Excellent point! |
One of the design points mentioned above
was discussed further offline (@grondo and me). As reported in #1543, we thought it would be better to keep the original idea of a separate ingest module and manager module:
In addition, I proposed the job-ingest module would
The event would be the mechanism by which distributed ingest modules (using FLUID jobids) would notify manager module(s) that new jobs have been ingested. The manager module would in turn interface with the scheduler and user tools. As I recall we brainstormed a bit on the manager module and how it would interact with tools and the scheduler. One idea from @grondo was that it might eventually support SQL queries on jobs, including completed jobs, similar to the sqlog add-on to SLURM. Pondering this further, it may be that job listing and even the scheduler interface could be usefully built on SQL queries. |
I was thinking something similar, but wondered if trying to define a rigid schema for job data might reduce our flexibility. I wonder if there is a document database we could leverage as simply as SQLite that we could use to stand up something very quicly, but grow as the needs of the scheduler and job query tools evolve? |
For the most part this issue is resolved, although there is a some good discussion here. |
Implement a minimal job service that accepts signed job requests and "enqueues" them in the KVS per the job scheme described in RFC 16.
Add a flux-submit command that converts user command line arguments into J, as described in RFC 15
Use YAML jobpec described in RFC 14.
Finally, add a flux-joblist or similar command for listing jobs.
The complete resource model need not be implemented at first. Simply supporting cores would be a start.
Implement a notification request where scheduler or other service can send in a sequence number of "last job request received", and obtain a block of job requests that have since arrived. Request blocks when there are no new requests available.
The text was updated successfully, but these errors were encountered: