-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomic job submit and job eventing subscription and etc #1534
Comments
(thinking of the new interface) what about a RPC interface for submit where you can optionally get multiple responses, one for each state transition, until a terminal state is reached?
When you subscribe to events but delay entering the reactor, those events are not lost. They are queued in the flux_t handle. I think as long as you subscribe, then submit the job, then enter reactor and filter events based on the job id, you should pick up all the state transitions. The downside is you get a lot of events that you must ignore, but maybe OK for now? |
Or maybe a better interface for this use case would be to mimic fork/exec(2) and wait(2): one API to submit job and obtain job id, and another to block until jobid reaches terminal state then return exit code(s) or other errors. Do users need the to track the intermediate states? |
I kind of like this idea -- users or tools may also want to track a stopped state when starting a job for debugger attach, but As a counterpoint, it does seem like restricting the job state(s) you can reliably watch for is unnecessarily limiting. However, I'm not sure if you were proposing that we limit the job states for which users could watch, or just that we provide a high level wait(2)-style function that runs a callback on a subset of job states (complete, stopped)? (but still allow low-level calls to watch any job state) |
I guess no reason to limit the states that can be waited for. I should also note that, if implemented as proposed in RFC 16, status in the KVS will be a log so calling wait on a particular state that has already been passed through can be reliably handled without races. There is a list of states in RFC 16 that probably needs to be amended too. (No stopped state for example). |
I was thinking of the wait function as a regular RPC that blocks until a particular state (or one of a set of states) is reached, and then is fulfilled. Rather than implementing a new callback, it would return a future and offer a "get" function for retrieving the state and any other information associated with the state transition (like exit code or other failure). |
Yeah, I meant to say that above, we might be trying to fix a problem we'll no longer have after replacing the current system. If the low level calls to watch for job states work reliably, then we could always build higher level functionality from these low level calls. (Like the submit-and-wait call you propose could be built from the submit() API coupled with wait-for-complete) One property of
Is the KVS job schema RFC the right place to enumerate all job states? This could be a bad idea, but I was wondering if we wanted to allow job states to be extensible anyway. For example, a different implementation of a job shell might have more or less states than what we propose in an RFC. Of course there will need to be a list of required states, but allowing providers to offer new states might offer some flexibility or experimentation. |
Yes, that seems straightforward and was what I assumed. I was more asking if you were proposing this was the only way to wait on job events, and if the function was limited to only the "complete" state as in the wait system call. Returning "other information associated with state transition" seems like it might be tricky, but I can see how it might be necessary in order to avoid other races. |
micmicing fork/exec wait is a good idea. We need to carefully put good mechanisms and APIs to allow for avoiding races, though, in particular when multiple users are allowed to use wait. We need to think about whether wait should be limited to a single job or a groupd of jobs as well. I like the same or related interfaces used for single and multiple. Unix Wait also has variants like waitpid or wait for all decendent pids. |
If the future returned by the wait RPC has a Also, it might be really useful to have filtering done on the job-manager side (presuming this is the service that responds to wait RPCs), so that a wait RPC can be initiated with not only jobid filtering, but other filters like username. |
Excellent comments! Do we need an abstraction something like "process groups" to allow waiting/signaling groups of jobs? |
@dongahn had mentioned that same thought before, and my initial reaction was to use a sub-instance if you need to treat a group of jobs as an ensemble. However, after having time to think about it, there may be use cases that can't be handled this way (for example, backfilling a series of small jobs treated as a group might be more efficient than trying to run the jobs in a sub-instance) Therefore, I do think adding process groups is a great idea. |
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
closed by #1988 |
@grondo suggested to open an issue from #1332 (comment)
As I was helping @koning, we encountered two issues (or I should just say nuances) in his use of
job.submit
rpc.He is currently using the JSC interface similar to this example code.
First thing he noticed was he wasn't able to easily pass the jobid of interest (only one job) at the time
jsc.notify_status
was called because thejobid
comes back with the response from thejob.submit
.As @grondo suggested, this can also be worked around if we used
job.create
first and thenjob.submit-nocreate.
But for simplicity, he currently stores thejobid
as a class member variable of a Python class which happens beforereactor_run
() and thus the first invocation of his job status callback.The second thing: his use case is such that the job wrapper only needs to monitor the state changes of a single job. But this JSC method invokes the job status call back for each state change of all of the jobs, which then would perturb job executions.
We could hide this by exposing richer abstraction through JSC, but I was wondering if we should actually consider the changing the topic string for a job event to include
jobid
so that one can subscribed to a state event for a specific job.These are certainly not high priority. But I'm discussion those so that we can factor them into our new execution system design.
The text was updated successfully, but these errors were encountered: