Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a kvs watch handler cannot modify the key it watches #81

Closed
garlick opened this issue Oct 22, 2014 · 3 comments · Fixed by #85
Closed

a kvs watch handler cannot modify the key it watches #81

garlick opened this issue Oct 22, 2014 · 3 comments · Fixed by #85

Comments

@garlick
Copy link
Member

garlick commented Oct 22, 2014

If a synchronous RPC is made within a kvs watch callback, and another kvs.watch response is received before the expected response to the RPC, it is put in a queue that will be processed the next time the handle blocks on a response. It will not, however, cause the reactor to become ready when it is re-entered following the watch callback.

@garlick
Copy link
Member Author

garlick commented Oct 22, 2014

Example code (minus error handling)

watch_cb() should be called twice, then flux_reactor_start() should return; however, watch_cb() is only called once, and the reactor loop is re-entered with the second watch response in a queue that cannot make the reactor ready.

int watch_cb (const char *key, int val, void *arg, int errnum)
{
    flux_t h = arg;
    kvs_put_int (h, key, val + 1);
    kvs_commit (h);
    return (val == 0 ? -1 : 0);
}

int main (int argc, char *argv[])
{
    flux_t h = flux_api_open ();
    kvs_put_int (h, "foo", -1);
    kvs_commit (h);
    kvs_watch_int (h, "foo", watch_cb, h);
    flux_reactor_start (h);
}

@garlick
Copy link
Member Author

garlick commented Oct 23, 2014

Incidentally I think this is the same bug that @lipari hit in the scheduler a while back.

garlick added a commit to garlick/flux-core that referenced this issue Oct 23, 2014
This allows our deferred queue to make the reactor ready and
should address issue flux-framework#81 for api sockets.
@lipari
Copy link
Contributor

lipari commented Oct 23, 2014

I've kept all the dialog regarding the problem which I can resurrect for the curious. The solution was to revamp the interaction with the kvs quite significantly. Here's a summary of what the original problem was:

The schedsvr_main first waits for a lwj to appear in the kvs. Once the first job is submitted, it proceeds with its initialization. That initialization includes calling kvs_put_int64 with an initial value for "event-counter". That call is followed by kvs_commit().

In the meantime, wreck/jobsrv.c is creating lwj.1 with its state of "reserved".

When the schedsvr invokes its kvs_commit(), the { "lwj.1.state": "reserved" } message gets added to p->deferred_responses based on this stack trace (which includes an abort I added for diagnostic purposes):

#0 0x00002aaaabe60925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00002aaaabe62105 in abort () at abort.c:92
#2 0x0000000000408694 in plugin_response_putmsg (impl=0x6de3e0, zmsg=0x2aaaf4408bd0) at plugin.c:138
#3 0x000000000040ebbf in flux_response_putmsg (h=0x6de620, zmsg=0x2aaaf4408bd0) at handle.c:154
#4 0x0000000000410395 in flux_response_matched_recvmsg (h=0x6de620, match=0x2aaaf87041b0 "kvs.commit", nb=false)
at handle.c:781
#5 0x00000000004105a7 in flux_rpc (h=0x6de620, request=0x2aaaf880cdf0, fmt=0x41ff39 "kvs.commit") at handle.c:810
#6 0x000000000040c5bf in kvs_commit (h=0x6de620) at kvscli.c:983
#7 0x00002aaaf4004767 in reg_event_hdlr (func=0x2aaaf4004ca1 <event_cb>) at wreck/schedsrv.c:912
#8 0x00002aaaf4004eea in schedsvr_main (p=0x6de620, args=0x6293f0) at wreck/schedsrv.c:1117
#9 0x0000000000409a05 in plugin_thread (arg=0x6de3e0) at plugin.c:563
#10 0x00002aaaabc179d1 in start_thread (arg=0x2aaaf4409700) at pthread_create.c:301
#11 0x00002aaaabf16b5d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

So, it looks like there is some cross talk happening.

In the meantime, I'm going to move the initialization steps to happen before the wait for the first job and see whether that avoids the problem.

Don

garlick added a commit to garlick/flux-core that referenced this issue Oct 23, 2014
    This allows deferred messages to make the reactor ready and
    should address issue flux-framework#81 for module handles.
garlick added a commit to garlick/flux-core that referenced this issue Oct 23, 2014
This allows deferred messages to make the reactor ready and
should address issue flux-framework#81 for module handles.
garlick added a commit to garlick/flux-core that referenced this issue Oct 23, 2014
grondo added a commit that referenced this issue Oct 24, 2014
handle deferrred messages with an inproc socket.
Fixes #81.
grondo pushed a commit to grondo/flux-core that referenced this issue Oct 31, 2014
This allows our deferred queue to make the reactor ready and
should address issue flux-framework#81 for api sockets.
grondo pushed a commit to grondo/flux-core that referenced this issue Oct 31, 2014
This allows deferred messages to make the reactor ready and
should address issue flux-framework#81 for module handles.
grondo pushed a commit to grondo/flux-core that referenced this issue Oct 31, 2014
grondo pushed a commit to grondo/flux-core that referenced this issue Dec 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants