-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4096 limit (per handle) on kvs_watch()? #802
Comments
I think you answered your own question but to confirm, each kvs_watch does use a matchag from the group pool, which is limited to 4K entries. I believe it would be safe to use the regular 64K pool for this, as the lower bits of the group tag are always zero for kvs.watch responses. It would also be reasonable to have the matchtag pools grow dynamically since response matching is no longer a performance issue. I wonder if there is a way we could convert all those watchers into one. I had already been thinking about including a list of modified keys in the setroot event. If a watcher could be registered on a glob... |
That is a good idea, and I was thinking the exact same thing (register watches with a glob). |
Can you explain a bit more? I apologize, admit I didn't understand why exactly the group pool was used for kvs_watch here. Is it ok for now to remove that flag under |
The group pool is described in RFC 6 and is used by I think you could simply make this change and there would be no problem (I verified make check still passes) diff --git a/src/modules/kvs/libkvs.c b/src/modules/kvs/libkvs.c
index 1e54496..06ff142 100644
--- a/src/modules/kvs/libkvs.c
+++ b/src/modules/kvs/libkvs.c
@@ -796,7 +796,7 @@ static int watch_rpc (flux_t h, const char *key, JSON *val,
/* Send the request.
*/
assert (once || matchtag != NULL);
- match.matchtag = flux_matchtag_alloc (h, FLUX_MATCHTAG_GROUP);
+ match.matchtag = flux_matchtag_alloc (h, 0);
if (match.matchtag == FLUX_MATCHTAG_NONE) {
errno = EAGAIN;
goto done; |
Thanks @garlick! That is what I did in my test directory and also noted that |
Some discussion came up WRT this in the meeting this morning. Is there a reason to have a hard cap on the number of items in the group? One of the reasons given was that the veb trees are pre-sized. We talked a while ago, in #474 I think, about pulling in Judy arrays to use as associative arrays (maybe with the c99 bindings header, since their interface is somewhat baroque) to replace some of the zhash/zlist usage, and it occurs to me that the base data structure (J1 rather than JL) is a dynamic bit-array of size |
Growing the veb pool as it gets full could be done if we want (up to max bits allowed in the protocol). My only hesitation in doing so is that there may be diminishing returns, given that there will still be a hard limit, and it seems like that many outstanding RPCs should tell us that we're doing something wrong. I did an experiment with JudyL in dispatch.c when I implemented the "response fastpath". I don't recall all the details, but the current setup was the winner: veb for tag pool, table lookup for response matching. I remember something about mallocs in the critical path being a worry with Judy although I don't know if I followed that up rigorously. Anyway, happy to do something like in nodeset.c to dynamically resize the tag pool if we want that. |
@garlick, We did bring up the exact same hesitations as you in the meeting, however @trws made the good point that a warning and slow performance might be preferrable to hard failures. Perhaps we can devise a way to keep the current limits but issue some sort of warning when matchtags are near running out (I have no idea of a good way to do that though) |
I'll go ahead and submit a quick PR to allow the tag pool to be able to grow since that seems to be the consensus and I don't feel too strongly the other way. On the plus side we can probably then reduce the default size and save a few Kbytes per handle. |
There is no need for kvs_watch() to use matchtags in the FLUX_MATCHTAG_GROUP range, since the lower bits aren't being used for anything, and this tag pool is limited in size, which can limit scalability of applications that use many kvs_watch() calls. Partially fixes flux-framework#802
I'm not sure what has changed, but in the current master, switching kvs watch off of the group tagpool causes test failures in kvs-basic and wreck. I tracked down the kvs basic one - it is the "watch unwatch" test. What seems to be happening is "unwatch" causes the "watch" tag to go back in the pool before all the watch responses have been processed. The tag gets reused by a commit, and the watch response gets confused with the commit response (protocol error). I'm trying to find out why this is suddenly failing when it wasn't before. Perhaps a regression from recently merged kvs commit changes, only umasked with the proposed watch tagpool change. |
The following Lua code fails on recent master:
Unfortunately, the failure is also silent on current master because the Lua bindings don't check return codes 👎. However with the following patch
I consistently get this result:
I did track this down to matchtag exhaustion, but I'm not sure there was a 4K per-handle limit on watchers before?
I do notice that
watch_rpc()
callsflux_matchtag_alloc()
withFLUX_MATCHTAG_GROUP
. Removing this flag obviously allows up to 64K watches to be installed, but I'm assumming it will break something else?In any event, the net result of this is that wreckrun is currently limited to about 2048 tasks in the default mode of "watching" kz stdout/stderr entries for every task. In current master the problem appears as a hang due to lack of return code checking (Sorry!), that will be a separate fix.
Even with 64K matchtag pool, flux-wreckrun is limited (in the current design) to 32 tasks. I'm not sure that would be a problem necssarily, we need to fix the IO for jobs anyway....
The text was updated successfully, but these errors were encountered: