Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Job Status and Control (JSC) interface #205

Merged
merged 8 commits into from
May 29, 2015
Merged

Integrate Job Status and Control (JSC) interface #205

merged 8 commits into from
May 29, 2015

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented May 26, 2015

Provide a high-level abstraction to allow monitoring and controlling
of the status of Flux jobs. Designed to expose the job status and
control abstraction in a way to hide the underlying data layout of
job information stored within Flux's KVS data store.
Expect that schedulers and runtime tools will be its main users. This
abstraction provides the producers of job information including
a task and program execution service module including wreckrun with
an opportunity to change and optimize the data layout of jobs within
the KVS without presenting major impacts to the implementation
of the schedulers and runtime tools.

The main design considerations are the following:
  • Abstract out implementation-dependent job states;
  • Provide flexible and easily extensible mechanisms to pass job info;
  • Use a minimalistic API set.
Build on Job Control Block (JCB):

Our data schema containing the information needed to manage a
particular job. Contain information such as jobid, resources owned
by the job, as well as the processes spawned by the job. JSC converts
the raw information on a job into an JCB, implemented as an JSON
dictionary object. The current JCB structure is described in tables
in README.md.

Provide three main calls of JSC:
  • jsc_notify_status () Register a callback to the asynchronous status change notification service. callback will be invoked when the state of a job changes;
  • jsc_query_jcb () Query the key attribute of JCB of jobid;
  • jsc_update_jcb () Update the key attribute within JCB of jobid.
Provide standalone utility and testers:

Created an utility command to facilitate testing: flux-jstat. Its
usage is the following:

Usage:
flux-jstat notify
flux-jstat query jobid
flux-jstat update jobid

flux-core/t/t2001-jsc.t also contains various test cases that use
this utility.

@garlick garlick added the review label May 26, 2015
@dongahn dongahn closed this May 26, 2015
@garlick garlick removed the review label May 26, 2015
@dongahn dongahn reopened this May 26, 2015
@garlick garlick added the review label May 26, 2015
@garlick
Copy link
Member

garlick commented May 27, 2015

I'd like to merge this so @dongahn can get on with his next steps, and so he will not have to track the flux API as we change it. Any input from @lipari or @grondo as this will be a bridge between scheduler and wreckrun/replacement?

@lipari
Copy link
Contributor

lipari commented May 27, 2015

I have reviewed the code and have no concerns. It's good to go as far as I'm concerned.

@grondo
Copy link
Contributor

grondo commented May 27, 2015

This looks fine to me!

I have a couple very minor style comments (questions, really), that should probably not hold up the merge:

  • In the README.md there are some very long lines. For consistency, I would prefer these be wrapped at column 78 or so. This is not only to ease editing in a console, but also will improve line-based diff output of future changes. (I know, I do see some tables that you won't be able to do anything about).

  • The typedef function pointer JSC_CB_PTR doesn't follow our unwritten naming convention for function types. I'm not sure how strongly I feel about this, but my preference would be to only capitalize CPP macros to avoid confusion. Also maybe follow the convention used in the Flux reactor and make the type a pointer to the callback function as seen with

    typedef int (*FluxMsgHandler)(flux_t h, int typemask, zmsg_t **zmsg, void *arg);

    e.g. for JSC_CB_PTR you would perhaps have

    typedef int (*JSCHandler)(json_object *base_jcb, void *arg, int errnum);

    It would help users of the API I think to have some common convention for the callback typedefs. (It doesn't have to be this convention here, maybe that is a separate discussion)

@dongahn
Copy link
Member Author

dongahn commented May 27, 2015

All: thank you for the reviews! Sounds all good. I will make the changes @grondo suggested and commit them to the jsc_topic branch so that they will appear in this PR.

@garlick
Copy link
Member

garlick commented May 27, 2015

I think somebody called me on the "studly caps" in those reactor callback typedefs, so I have been switching to this form: typedef int (*foo_whatever_f)(args). I guess we should pick a style and update RFC 7.

@grondo
Copy link
Contributor

grondo commented May 27, 2015

I definitely prefer

typedef int (*foo_whatever_f)(args)

The trailing _f calls out that this is a function pointer type.

dongahn added 2 commits May 27, 2015 19:25
Apparently, there is no markdown construct that allows
for table row expansion. So some lines corresponding to
table rows still remain to be long.
@dongahn
Copy link
Member Author

dongahn commented May 28, 2015

Hmm Travis is not my friend... ( or it is :-) I will have to look at the failure but this could be run_timeout 2 might not be long enough for wreckrun to run 4 node jobs. A big question is how to reproduce the same issue in our local development environment...

Any suggestions?

@garlick
Copy link
Member

garlick commented May 28, 2015

Not sure if it will be of help here, but @grondo provided a Dockerfile in PR #191 that mimics the Travis environment.

@dongahn
Copy link
Member Author

dongahn commented May 28, 2015

@garlick and @grondo Dockerfile support should be useful!

I looked at the travis log files and thought about this problem a bit. My current suspicion is, this is due to the produce-consumer synchronization issue I documented in
README.md:

JCB producer-consumer synchronization -- currently there is no
built-in synchronization between JCB producers and consumers and thus a
race condition can occur. When the remote parallel execution changes
the state of a job, and the registered callbacks will be invoked.
However, when one of the invoked callbacks is trying to read an JCB
attribute, nothing prevents the remote execution from modifying the
same JCB attribute! Because producers and consumers use the KVS like a
distributed shared memory, one must devise ways to guarantee
synchronization. One solution is for the producers also use the JSC API
and we build some synchronization primitives into this API. But for
now, we ignore these synchronization issues.

In a nutshell, JSC notification service installs a job-state watcher (e.g., lwj.5.state) when it gets notified of a new job directory creation (e.g., lwj.5) event from KVS, and it expects the installed watcher will immediately get invoked with ENOEVET to begin with.

However, on a failed Travis run, this job-state watcher actually gets invoked with "reserved" state, which then seems to mess up the expected job-state change sequence.

I suspect that this be caused by a race between wreckrun storing a new state ("reserved") and JSC installing a state-watcher. Depending on who wins the race, the behavior changes.

I think I can break this race by posting the initial new-job-creation event (J_NULL) to the JSC users before installing the state watcher. This way, JSC's new-job-creation event and wreckrun's subsequent state-change events will form a clear happens-before relationship.

However, by doing this, there will be a larger blind window during which JSC won't be able to watch the job state changes, making JSC notification service imprecise

I think we should revisit this issue (needing to have efficient producer-consumer synchronization) in the context of @grondo's task/program execution service.

I also had some discussions this morning with @garlick, and perhaps efficient synchronization primitives exposed by KVS can help such a synchronization. So I will file an issue ticket for this.

At any rate, I will run this patch through Travis several times to make sure this doesn't produce any other side effects before applying it to this PR.

@garlick
Copy link
Member

garlick commented May 28, 2015

To make sure I understand:

  1. libjsc calls kvs_watch lwj.next-id . When wreckrun launches a job it increments the value and libjsc gets a callback, reads new id
  2. the callback then kvs_watches lwj.N.state and gets a callback on each state transition for the job.

Say the libjsc users gets swapped out while a couple jobs run. When it wakes up, it will get a callback for each new job, but when it installs the watch on the lwj.N.state, it will get one callback for state complete. Or depending on when it wakes up, it may see an arbitrary set of (contiguous) earlier state changes per job.

This is fundamentally racy. It feels like the KVS should be capable of giving you an interface that enables this to work without races. For example (brainstorming):

  • giving you a reference that includes the SHA1 for a point in the hierarchy at a specific point in time (e.g. the lwj directory SHA1 at the moment of your lwj.next-id watch callback). Think of this value as a snapshot.
  • the ability to traverse the hierarchy from a snapshot, e.g. kvsdir_get() except its kvsdir_t argument is no longer just a list of directory entry names; it's also a snapshot reference. The snapshot is effectively read-only so it cannot change out from underneath you.
  • a kvs_watch call that accepts a kvsdir_t argument (now a snapshot ref) and a key path relative to that directory.

Then when lwj.next-id increments you would install a watch on (ref-to-lwj-dir, N.state) and then be assured that all state transitions from that snapshot in time forward are captured.

The atomic namespace traversal idea is already captured in issue #64 which I'll mention here so this usecase will be noted there.

@garlick
Copy link
Member

garlick commented May 28, 2015

Erm. I think I went a little too far above. There is no record of the chain of SHA1 values taken by root, much less by arbitrary directories, so making that information available via kvs_watch() isn't currently possible. I opened issue #206.

@dongahn
Copy link
Member Author

dongahn commented May 28, 2015

@garlick:

Re: the problem description, you are correct.

Re: the brain storming -- after we discussed this, I also thought about this a bit. The general concept for this may be a generalized watch. The main race problem seem to emerge because of the fact that the current watch only supports fine granularity installation -- i.e., having to watch each individual key.

Instead, if you enable a user to watch a point in a hierarchy, which then gets invoked on any put event occurring at the subtree rooted at that point, you can have a transactional watch.

Now, there will be two problems at least.
One is, there could be lots of other puts that the user doesn't necessarily want to be notified of. So perhaps, you may need a way to provide some idioms for the user to only target "state changes of interests." (e.g., regularly expressions of some sort?)

Second, when it comes down to a race, one might ask "how do you guarantee the atomicity of the initial generalized watch installation?" This is a famous "turtles all the way down" humor, but we will need some basic atomic watch installation for the initial case. Equally, we can simply rely on the producers and consumers to synchronize each other outside KVS subsystem to guarantee the initial atomicity.

I would like to brainstorm further on the atomic namespace traversal through #64.

One thing that the scheduler thrust may want to explore is to have revision handles that represent important states of resource allocation states.

This is again how one can implement resource the gaps_list idea using what we have in KVS. A scheduler can do whole bunch of puts and then gets a revision handle that represents that resource state . Then, our gaps_list can be represented as a series of revision handles. I don't know if this will scale in terms of performance but this should scale in terms of memory use, I think.

I would like to take a bit deeper dive into this topic after @SteVwonder starts his internship.

@garlick
Copy link
Member

garlick commented May 28, 2015

OK let's continue this is #206 and #64. +1 on getting @SteVwonder involved. That'll be good timing I think.

@dongahn
Copy link
Member Author

dongahn commented May 29, 2015

Ok. The patch works around the race problem in the JSC tests. I've run travis test 9 times and they all succeeded. Please merge.

garlick added a commit that referenced this pull request May 29, 2015
Integrate Job Status and Control (JSC) interface
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants