-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Job Status and Control (JSC) interface #205
Conversation
I have reviewed the code and have no concerns. It's good to go as far as I'm concerned. |
This looks fine to me! I have a couple very minor style comments (questions, really), that should probably not hold up the merge:
|
All: thank you for the reviews! Sounds all good. I will make the changes @grondo suggested and commit them to the jsc_topic branch so that they will appear in this PR. |
I think somebody called me on the "studly caps" in those reactor callback typedefs, so I have been switching to this form: |
I definitely prefer typedef int (*foo_whatever_f)(args) The trailing |
Apparently, there is no markdown construct that allows for table row expansion. So some lines corresponding to table rows still remain to be long.
Hmm Travis is not my friend... ( or it is :-) I will have to look at the failure but this could be Any suggestions? |
@garlick and @grondo Dockerfile support should be useful! I looked at the travis log files and thought about this problem a bit. My current suspicion is, this is due to the produce-consumer synchronization issue I documented in
In a nutshell, JSC notification service installs a job-state watcher (e.g., lwj.5.state) when it gets notified of a new job directory creation (e.g., lwj.5) event from KVS, and it expects the installed watcher will immediately get invoked with ENOEVET to begin with. However, on a failed Travis run, this job-state watcher actually gets invoked with "reserved" state, which then seems to mess up the expected job-state change sequence. I suspect that this be caused by a race between I think I can break this race by posting the initial new-job-creation event (J_NULL) to the JSC users before installing the state watcher. This way, JSC's new-job-creation event and However, by doing this, there will be a larger blind window during which JSC won't be able to watch the job state changes, making JSC notification service imprecise I think we should revisit this issue (needing to have efficient producer-consumer synchronization) in the context of @grondo's task/program execution service. I also had some discussions this morning with @garlick, and perhaps efficient synchronization primitives exposed by KVS can help such a synchronization. So I will file an issue ticket for this. At any rate, I will run this patch through Travis several times to make sure this doesn't produce any other side effects before applying it to this PR. |
To make sure I understand:
Say the libjsc users gets swapped out while a couple jobs run. When it wakes up, it will get a callback for each new job, but when it installs the watch on the This is fundamentally racy. It feels like the KVS should be capable of giving you an interface that enables this to work without races. For example (brainstorming):
Then when The atomic namespace traversal idea is already captured in issue #64 which I'll mention here so this usecase will be noted there. |
Erm. I think I went a little too far above. There is no record of the chain of SHA1 values taken by root, much less by arbitrary directories, so making that information available via kvs_watch() isn't currently possible. I opened issue #206. |
Re: the problem description, you are correct. Re: the brain storming -- after we discussed this, I also thought about this a bit. The general concept for this may be a generalized watch. The main race problem seem to emerge because of the fact that the current watch only supports fine granularity installation -- i.e., having to watch each individual key. Instead, if you enable a user to watch a point in a hierarchy, which then gets invoked on any Now, there will be two problems at least. Second, when it comes down to a race, one might ask "how do you guarantee the atomicity of the initial generalized watch installation?" This is a famous "turtles all the way down" humor, but we will need some basic atomic watch installation for the initial case. Equally, we can simply rely on the producers and consumers to synchronize each other outside KVS subsystem to guarantee the initial atomicity. I would like to brainstorm further on the atomic namespace traversal through #64. One thing that the scheduler thrust may want to explore is to have revision handles that represent important states of resource allocation states. This is again how one can implement resource the I would like to take a bit deeper dive into this topic after @SteVwonder starts his internship. |
OK let's continue this is #206 and #64. +1 on getting @SteVwonder involved. That'll be good timing I think. |
Ok. The patch works around the race problem in the JSC tests. I've run travis test 9 times and they all succeeded. Please merge. |
Integrate Job Status and Control (JSC) interface
Provide a high-level abstraction to allow monitoring and controlling
of the status of Flux jobs. Designed to expose the job status and
control abstraction in a way to hide the underlying data layout of
job information stored within Flux's KVS data store.
Expect that schedulers and runtime tools will be its main users. This
abstraction provides the producers of job information including
a task and program execution service module including wreckrun with
an opportunity to change and optimize the data layout of jobs within
the KVS without presenting major impacts to the implementation
of the schedulers and runtime tools.
The main design considerations are the following:
Build on Job Control Block (JCB):
Our data schema containing the information needed to manage a
particular job. Contain information such as jobid, resources owned
by the job, as well as the processes spawned by the job. JSC converts
the raw information on a job into an JCB, implemented as an JSON
dictionary object. The current JCB structure is described in tables
in README.md.
Provide three main calls of JSC:
Provide standalone utility and testers:
Created an utility command to facilitate testing: flux-jstat. Its
usage is the following:
Usage:
flux-jstat notify
flux-jstat query jobid
flux-jstat update jobid
flux-core/t/t2001-jsc.t
also contains various test cases that usethis utility.