Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4096 limit (per handle) on kvs_watch()? #802

Closed
grondo opened this issue Sep 9, 2016 · 10 comments
Closed

4096 limit (per handle) on kvs_watch()? #802

grondo opened this issue Sep 9, 2016 · 10 comments

Comments

@grondo
Copy link
Contributor

grondo commented Sep 9, 2016

The following Lua code fails on recent master:

local f = assert (require 'flux' .new ())

local function errf (...)
    io.stderr:write (string.format (...))
    os.exit (1)
end

for i = 1, 5000 do
    local kw, err = f:kvswatcher {
            key = "x".."."..i,
            handler = function () end
    }
   if not kw then errf ("failed at i=%d: %s\n", i, err) end
end

Unfortunately, the failure is also silent on current master because the Lua bindings don't check return codes 👎. However with the following patch

diff --git a/src/bindings/lua/flux-lua.c b/src/bindings/lua/flux-lua.c
index abd1291..41814ea 100644
--- a/src/bindings/lua/flux-lua.c
+++ b/src/bindings/lua/flux-lua.c
@@ -1073,6 +1073,7 @@ static int l_kvswatcher_remove (lua_State *L)

 static int l_kvswatcher_add (lua_State *L)
 {
+    int rc;
     struct l_flux_ref *kw = NULL;
     flux_t f = lua_get_flux (L, 1);
     const char *key;
@@ -1096,9 +1097,11 @@ static int l_kvswatcher_add (lua_State *L)
     kw = l_flux_ref_create (L, f, 2, "kvswatcher");
     lua_getfield (L, 2, "isdir");
     if (lua_toboolean (L, -1))
-        kvs_watch_dir (f, l_kvsdir_watcher, (void *) kw, key);
+        rc = kvs_watch_dir (f, l_kvsdir_watcher, (void *) kw, key);
     else
-        kvs_watch (f, key, l_kvswatcher, (void *) kw);
+        rc = kvs_watch (f, key, l_kvswatcher, (void *) kw);
+    if (rc < 0)
+        return lua_pusherror (L, strerror (errno));
     /*
      *  Return kvswatcher object to caller
      */

I consistently get this result:

$ lua test.lua
failed at i=4096: Resource temporarily unavailable

I did track this down to matchtag exhaustion, but I'm not sure there was a 4K per-handle limit on watchers before?

I do notice that watch_rpc() calls flux_matchtag_alloc() with FLUX_MATCHTAG_GROUP. Removing this flag obviously allows up to 64K watches to be installed, but I'm assumming it will break something else?

In any event, the net result of this is that wreckrun is currently limited to about 2048 tasks in the default mode of "watching" kz stdout/stderr entries for every task. In current master the problem appears as a hang due to lack of return code checking (Sorry!), that will be a separate fix.

Even with 64K matchtag pool, flux-wreckrun is limited (in the current design) to 32 tasks. I'm not sure that would be a problem necssarily, we need to fix the IO for jobs anyway....

@garlick
Copy link
Member

garlick commented Sep 10, 2016

I think you answered your own question but to confirm, each kvs_watch does use a matchag from the group pool, which is limited to 4K entries. I believe it would be safe to use the regular 64K pool for this, as the lower bits of the group tag are always zero for kvs.watch responses. It would also be reasonable to have the matchtag pools grow dynamically since response matching is no longer a performance issue.

I wonder if there is a way we could convert all those watchers into one. I had already been thinking about including a list of modified keys in the setroot event. If a watcher could be registered on a glob...

@grondo
Copy link
Contributor Author

grondo commented Sep 10, 2016

I wonder if there is a way we could convert all those watchers into one. I had already been thinking about including a list of modified keys in the setroot event. If a watcher could be registered on a glob...

That is a good idea, and I was thinking the exact same thing (register watches with a glob).

@grondo
Copy link
Contributor Author

grondo commented Sep 10, 2016

I believe it would be safe to use the regular 64K pool for this, as the lower bits of the group tag are always zero for kvs.watch responses.

Can you explain a bit more? I apologize, admit I didn't understand why exactly the group pool was used for kvs_watch here. Is it ok for now to remove that flag under kz_set_ready_cb specifically?

@garlick
Copy link
Member

garlick commented Sep 10, 2016

Can you explain a bit more? I apologize, admit I didn't understand why exactly the group pool was used for kvs_watch here. Is it ok for now to remove that flag under kz_set_ready_cb specifically?

The group pool is described in RFC 6 and is used by flux_rpc_multi(). It allows a message handler to be registered for a block of 32-bit matchtags, and for the lower bits to contain some useful information like the responding nodeid. Honestly I can't remember why I thought to use it with kvs_watch() other than I did want to cover that use case with something like flux_rpc_multi() and never got there.

I think you could simply make this change and there would be no problem (I verified make check still passes)

diff --git a/src/modules/kvs/libkvs.c b/src/modules/kvs/libkvs.c
index 1e54496..06ff142 100644
--- a/src/modules/kvs/libkvs.c
+++ b/src/modules/kvs/libkvs.c
@@ -796,7 +796,7 @@ static int watch_rpc (flux_t h, const char *key, JSON *val,
     /* Send the request.
      */
     assert (once || matchtag != NULL);
-    match.matchtag = flux_matchtag_alloc (h, FLUX_MATCHTAG_GROUP);
+    match.matchtag = flux_matchtag_alloc (h, 0);
     if (match.matchtag == FLUX_MATCHTAG_NONE) {
         errno = EAGAIN;
         goto done;

@grondo
Copy link
Contributor Author

grondo commented Sep 10, 2016

Thanks @garlick! That is what I did in my test directory and also noted that make check passes. I just wasn't sure if there was some untested corner case I was missing. Thanks for the explanation too.

@trws
Copy link
Member

trws commented Sep 12, 2016

Some discussion came up WRT this in the meeting this morning. Is there a reason to have a hard cap on the number of items in the group? One of the reasons given was that the veb trees are pre-sized. We talked a while ago, in #474 I think, about pulling in Judy arrays to use as associative arrays (maybe with the c99 bindings header, since their interface is somewhat baroque) to replace some of the zhash/zlist usage, and it occurs to me that the base data structure (J1 rather than JL) is a dynamic bit-array of size 2^<word-size>. It's probably not quite as fast as a veb, but it might make it easier to deal with spikes in matchtag usage. Thoughts @garlick?

@garlick
Copy link
Member

garlick commented Sep 12, 2016

Growing the veb pool as it gets full could be done if we want (up to max bits allowed in the protocol). My only hesitation in doing so is that there may be diminishing returns, given that there will still be a hard limit, and it seems like that many outstanding RPCs should tell us that we're doing something wrong.

I did an experiment with JudyL in dispatch.c when I implemented the "response fastpath". I don't recall all the details, but the current setup was the winner: veb for tag pool, table lookup for response matching. I remember something about mallocs in the critical path being a worry with Judy although I don't know if I followed that up rigorously.

Anyway, happy to do something like in nodeset.c to dynamically resize the tag pool if we want that.

@grondo
Copy link
Contributor Author

grondo commented Sep 12, 2016

@garlick, We did bring up the exact same hesitations as you in the meeting, however @trws made the good point that a warning and slow performance might be preferrable to hard failures. Perhaps we can devise a way to keep the current limits but issue some sort of warning when matchtags are near running out (I have no idea of a good way to do that though)

@garlick
Copy link
Member

garlick commented Sep 12, 2016

I'll go ahead and submit a quick PR to allow the tag pool to be able to grow since that seems to be the consensus and I don't feel too strongly the other way. On the plus side we can probably then reduce the default size and save a few Kbytes per handle.

garlick added a commit to garlick/flux-core that referenced this issue Sep 13, 2016
There is no need for kvs_watch() to use matchtags in
the FLUX_MATCHTAG_GROUP range, since the lower bits aren't
being used for anything, and this tag pool is limited in
size, which can limit scalability of applications that
use many kvs_watch() calls.

Partially fixes flux-framework#802
@garlick
Copy link
Member

garlick commented Sep 13, 2016

I'm not sure what has changed, but in the current master, switching kvs watch off of the group tagpool causes test failures in kvs-basic and wreck.

I tracked down the kvs basic one - it is the "watch unwatch" test. What seems to be happening is "unwatch" causes the "watch" tag to go back in the pool before all the watch responses have been processed. The tag gets reused by a commit, and the watch response gets confused with the commit response (protocol error).

I'm trying to find out why this is suddenly failing when it wasn't before. Perhaps a regression from recently merged kvs commit changes, only umasked with the proposed watch tagpool change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants