Integrate Job Status and Control (JSC) interface #205

dongahn · 2015-05-26T22:21:44Z

Provide a high-level abstraction to allow monitoring and controlling
of the status of Flux jobs. Designed to expose the job status and
control abstraction in a way to hide the underlying data layout of
job information stored within Flux's KVS data store.
Expect that schedulers and runtime tools will be its main users. This
abstraction provides the producers of job information including
a task and program execution service module including wreckrun with
an opportunity to change and optimize the data layout of jobs within
the KVS without presenting major impacts to the implementation
of the schedulers and runtime tools.

The main design considerations are the following:

Abstract out implementation-dependent job states;
Provide flexible and easily extensible mechanisms to pass job info;
Use a minimalistic API set.

Build on Job Control Block (JCB):

Our data schema containing the information needed to manage a
particular job. Contain information such as jobid, resources owned
by the job, as well as the processes spawned by the job. JSC converts
the raw information on a job into an JCB, implemented as an JSON
dictionary object. The current JCB structure is described in tables
in README.md.

Provide three main calls of JSC:

jsc_notify_status () Register a callback to the asynchronous status change notification service. callback will be invoked when the state of a job changes;
jsc_query_jcb () Query the key attribute of JCB of jobid;
jsc_update_jcb () Update the key attribute within JCB of jobid.

Provide standalone utility and testers:

Created an utility command to facilitate testing: flux-jstat. Its
usage is the following:

Usage:
flux-jstat notify
flux-jstat query jobid
flux-jstat update jobid

flux-core/t/t2001-jsc.t also contains various test cases that use
this utility.

garlick · 2015-05-27T17:27:20Z

I'd like to merge this so @dongahn can get on with his next steps, and so he will not have to track the flux API as we change it. Any input from @lipari or @grondo as this will be a bridge between scheduler and wreckrun/replacement?

lipari · 2015-05-27T17:41:39Z

I have reviewed the code and have no concerns. It's good to go as far as I'm concerned.

grondo · 2015-05-27T17:47:25Z

This looks fine to me!

I have a couple very minor style comments (questions, really), that should probably not hold up the merge:

In the README.md there are some very long lines. For consistency, I would prefer these be wrapped at column 78 or so. This is not only to ease editing in a console, but also will improve line-based diff output of future changes. (I know, I do see some tables that you won't be able to do anything about).
The typedef function pointer JSC_CB_PTR doesn't follow our unwritten naming convention for function types. I'm not sure how strongly I feel about this, but my preference would be to only capitalize CPP macros to avoid confusion. Also maybe follow the convention used in the Flux reactor and make the type a pointer to the callback function as seen with
```
typedef int (*FluxMsgHandler)(flux_t h, int typemask, zmsg_t **zmsg, void *arg);
```
e.g. for JSC_CB_PTR you would perhaps have
```
typedef int (*JSCHandler)(json_object *base_jcb, void *arg, int errnum);
```
It would help users of the API I think to have some common convention for the callback typedefs. (It doesn't have to be this convention here, maybe that is a separate discussion)

dongahn · 2015-05-27T17:51:31Z

All: thank you for the reviews! Sounds all good. I will make the changes @grondo suggested and commit them to the jsc_topic branch so that they will appear in this PR.

garlick · 2015-05-27T17:54:13Z

I think somebody called me on the "studly caps" in those reactor callback typedefs, so I have been switching to this form: typedef int (*foo_whatever_f)(args). I guess we should pick a style and update RFC 7.

grondo · 2015-05-27T17:59:49Z

I definitely prefer

typedef int (*foo_whatever_f)(args)

The trailing _f calls out that this is a function pointer type.

Apparently, there is no markdown construct that allows for table row expansion. So some lines corresponding to table rows still remain to be long.

dongahn · 2015-05-28T03:35:35Z

Hmm Travis is not my friend... ( or it is :-) I will have to look at the failure but this could be run_timeout 2 might not be long enough for wreckrun to run 4 node jobs. A big question is how to reproduce the same issue in our local development environment...

Any suggestions?

garlick · 2015-05-28T15:39:49Z

Not sure if it will be of help here, but @grondo provided a Dockerfile in PR #191 that mimics the Travis environment.

…tion

dongahn · 2015-05-28T18:05:40Z

@garlick and @grondo Dockerfile support should be useful!

I looked at the travis log files and thought about this problem a bit. My current suspicion is, this is due to the produce-consumer synchronization issue I documented in
README.md:

JCB producer-consumer synchronization -- currently there is no
built-in synchronization between JCB producers and consumers and thus a
race condition can occur. When the remote parallel execution changes
the state of a job, and the registered callbacks will be invoked.
However, when one of the invoked callbacks is trying to read an JCB
attribute, nothing prevents the remote execution from modifying the
same JCB attribute! Because producers and consumers use the KVS like a
distributed shared memory, one must devise ways to guarantee
synchronization. One solution is for the producers also use the JSC API
and we build some synchronization primitives into this API. But for
now, we ignore these synchronization issues.

In a nutshell, JSC notification service installs a job-state watcher (e.g., lwj.5.state) when it gets notified of a new job directory creation (e.g., lwj.5) event from KVS, and it expects the installed watcher will immediately get invoked with ENOEVET to begin with.

However, on a failed Travis run, this job-state watcher actually gets invoked with "reserved" state, which then seems to mess up the expected job-state change sequence.

I suspect that this be caused by a race between wreckrun storing a new state ("reserved") and JSC installing a state-watcher. Depending on who wins the race, the behavior changes.

I think I can break this race by posting the initial new-job-creation event (J_NULL) to the JSC users before installing the state watcher. This way, JSC's new-job-creation event and wreckrun's subsequent state-change events will form a clear happens-before relationship.

However, by doing this, there will be a larger blind window during which JSC won't be able to watch the job state changes, making JSC notification service imprecise

I think we should revisit this issue (needing to have efficient producer-consumer synchronization) in the context of @grondo's task/program execution service.

I also had some discussions this morning with @garlick, and perhaps efficient synchronization primitives exposed by KVS can help such a synchronization. So I will file an issue ticket for this.

At any rate, I will run this patch through Travis several times to make sure this doesn't produce any other side effects before applying it to this PR.

garlick · 2015-05-28T18:57:24Z

To make sure I understand:

libjsc calls kvs_watch lwj.next-id . When wreckrun launches a job it increments the value and libjsc gets a callback, reads new id
the callback then kvs_watches lwj.N.state and gets a callback on each state transition for the job.

Say the libjsc users gets swapped out while a couple jobs run. When it wakes up, it will get a callback for each new job, but when it installs the watch on the lwj.N.state, it will get one callback for state complete. Or depending on when it wakes up, it may see an arbitrary set of (contiguous) earlier state changes per job.

This is fundamentally racy. It feels like the KVS should be capable of giving you an interface that enables this to work without races. For example (brainstorming):

giving you a reference that includes the SHA1 for a point in the hierarchy at a specific point in time (e.g. the lwj directory SHA1 at the moment of your lwj.next-id watch callback). Think of this value as a snapshot.
the ability to traverse the hierarchy from a snapshot, e.g. kvsdir_get() except its kvsdir_t argument is no longer just a list of directory entry names; it's also a snapshot reference. The snapshot is effectively read-only so it cannot change out from underneath you.
a kvs_watch call that accepts a kvsdir_t argument (now a snapshot ref) and a key path relative to that directory.

Then when lwj.next-id increments you would install a watch on (ref-to-lwj-dir, N.state) and then be assured that all state transitions from that snapshot in time forward are captured.

The atomic namespace traversal idea is already captured in issue #64 which I'll mention here so this usecase will be noted there.

garlick · 2015-05-28T19:27:00Z

Erm. I think I went a little too far above. There is no record of the chain of SHA1 values taken by root, much less by arbitrary directories, so making that information available via kvs_watch() isn't currently possible. I opened issue #206.

dongahn · 2015-05-28T20:27:29Z

@garlick:

Re: the problem description, you are correct.

Re: the brain storming -- after we discussed this, I also thought about this a bit. The general concept for this may be a generalized watch. The main race problem seem to emerge because of the fact that the current watch only supports fine granularity installation -- i.e., having to watch each individual key.

Instead, if you enable a user to watch a point in a hierarchy, which then gets invoked on any put event occurring at the subtree rooted at that point, you can have a transactional watch.

Now, there will be two problems at least.
One is, there could be lots of other puts that the user doesn't necessarily want to be notified of. So perhaps, you may need a way to provide some idioms for the user to only target "state changes of interests." (e.g., regularly expressions of some sort?)

Second, when it comes down to a race, one might ask "how do you guarantee the atomicity of the initial generalized watch installation?" This is a famous "turtles all the way down" humor, but we will need some basic atomic watch installation for the initial case. Equally, we can simply rely on the producers and consumers to synchronize each other outside KVS subsystem to guarantee the initial atomicity.

I would like to brainstorm further on the atomic namespace traversal through #64.

One thing that the scheduler thrust may want to explore is to have revision handles that represent important states of resource allocation states.

This is again how one can implement resource the gaps_list idea using what we have in KVS. A scheduler can do whole bunch of puts and then gets a revision handle that represents that resource state . Then, our gaps_list can be represented as a series of revision handles. I don't know if this will scale in terms of performance but this should scale in terms of memory use, I think.

I would like to take a bit deeper dive into this topic after @SteVwonder starts his internship.

garlick · 2015-05-28T20:58:21Z

OK let's continue this is #206 and #64. +1 on getting @SteVwonder involved. That'll be good timing I think.

dongahn · 2015-05-29T00:51:40Z

Ok. The patch works around the race problem in the JSC tests. I've run travis test 9 times and they all succeeded. Please merge.

Integrate Job Status and Control (JSC) interface

dongahn added 4 commits May 26, 2015 13:45

Add Jadd_int64 support

b7d1372

Add Job Status and Control (JSC) interface support

1fb9f7c

Add flux-jstat

3cfb887

Add tests for JSC including --long support

47f9119

garlick added the review label May 26, 2015

dongahn closed this May 26, 2015

garlick removed the review label May 26, 2015

dongahn reopened this May 26, 2015

garlick added the review label May 26, 2015

garlick mentioned this pull request May 27, 2015

rfc7: add std for variable names and typedefs flux-framework/rfc#22

Merged

dongahn added 2 commits May 27, 2015 19:25

Change a typedef function pointer name per RFC 7 style revision

870ec2e

Wrap markdown source lines at 72

59e7815

Apparently, there is no markdown construct that allows for table row expansion. So some lines corresponding to table rows still remain to be long.

Increase run_time to 4 for wreckrun instances

251ae18

Break a race between wreckrun's state change and initial jcb notifica…

15b6f87

…tion

garlick mentioned this pull request May 28, 2015

Synchronizing producers and consumers via the KVS #206

Closed

garlick added a commit that referenced this pull request May 29, 2015

Merge pull request #205 from dongahn/jsc_topics

48534ac

Integrate Job Status and Control (JSC) interface

garlick merged commit 48534ac into flux-framework:master May 29, 2015

garlick removed the review label May 29, 2015

dongahn mentioned this pull request May 29, 2015

Make JCB in Job Status and Control interface an abstract type #208

Closed

dongahn deleted the jsc_topics branch May 29, 2015 05:36

garlick mentioned this pull request Jun 29, 2015

wreck job state changes cannot be reliably collected by JSC #234

Closed

dongahn mentioned this pull request Mar 23, 2018

splash app tracking issue #1358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Job Status and Control (JSC) interface #205

Integrate Job Status and Control (JSC) interface #205

dongahn commented May 26, 2015

garlick commented May 27, 2015

lipari commented May 27, 2015

grondo commented May 27, 2015

dongahn commented May 27, 2015

garlick commented May 27, 2015

grondo commented May 27, 2015

dongahn commented May 28, 2015

garlick commented May 28, 2015

dongahn commented May 28, 2015

garlick commented May 28, 2015

garlick commented May 28, 2015

dongahn commented May 28, 2015

garlick commented May 28, 2015

dongahn commented May 29, 2015

Integrate Job Status and Control (JSC) interface #205

Integrate Job Status and Control (JSC) interface #205

Conversation

dongahn commented May 26, 2015

The main design considerations are the following:

Build on Job Control Block (JCB):

Provide three main calls of JSC:

Provide standalone utility and testers:

garlick commented May 27, 2015

lipari commented May 27, 2015

grondo commented May 27, 2015

dongahn commented May 27, 2015

garlick commented May 27, 2015

grondo commented May 27, 2015

dongahn commented May 28, 2015

garlick commented May 28, 2015

dongahn commented May 28, 2015

garlick commented May 28, 2015

garlick commented May 28, 2015

dongahn commented May 28, 2015

garlick commented May 28, 2015

dongahn commented May 29, 2015