Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement KVS transaction object #1107

Merged
merged 18 commits into from
Jul 19, 2017
Merged

Conversation

garlick
Copy link
Member

@garlick garlick commented Jul 14, 2017

This PR implements the proposed flux_kvs_txn_t objects for the KVS discussed in #1094:

flux_kvs_txn_t *flux_kvs_txn_create (void);
void flux_kvs_txn_destroy (flux_kvs_txn_t *txn);

int flux_kvs_txn_put (flux_kvs_txn_t *txn, int flags,
                      const char *key, const char *json_str);

int flux_kvs_txn_pack (flux_kvs_txn_t *txn, int flags,
                       const char *key, const char *fmt, ...);

int flux_kvs_txn_link (flux_kvs_txn_t *txn, int flags,
                       const char *key, const char *json_str);

int flux_kvs_txn_mkdir (flux_kvs_txn_t *txn, int flags,
                        const char *key);

int flux_kvs_txn_unlink (flux_kvs_txn_t *txn, int flags,
                         const char *key);

int flux_kvs_txn_symlink (flux_kvs_txn_t *txn, int flags,
                          const char *key, const char *target);

flux_future_t *flux_kvs_commit (flux_t *h, int flags, flux_kvs_txn_t *txn);

flux_future_t *flux_kvs_fence (flux_t *h, int flags, const char *name,
                               int nprocs, flux_kvs_txn_t *txn);

As discussed in #1094 we'll want some convenience functions since having to create/put/commit/destroy to change a single value is a bit cumbersome. For now the old interfaces are still there, rewritten in terms of the above.

This PR also internally converts the KVS client API to use jansson.

@garlick garlick changed the title implement flux_kvs_txn_t implement KVS transaction object Jul 14, 2017
@codecov-io
Copy link

codecov-io commented Jul 14, 2017

Codecov Report

Merging #1107 into master will decrease coverage by 0.31%.
The diff coverage is 74.62%.

@@            Coverage Diff             @@
##           master    #1107      +/-   ##
==========================================
- Coverage   78.34%   78.03%   -0.32%     
==========================================
  Files         157      159       +2     
  Lines       26132    26232     +100     
==========================================
- Hits        20474    20469       -5     
- Misses       5658     5763     +105
Impacted Files Coverage Δ
src/common/libkvs/kvs.c 100% <ø> (+19.93%) ⬆️
src/modules/kvs/fence.c 85.24% <100%> (-0.24%) ⬇️
src/modules/kvs/commit.c 87.65% <100%> (ø) ⬆️
src/modules/wreck/wrexecd.c 75.08% <50%> (-0.3%) ⬇️
src/common/libkvs/kvs_txn.c 61% <61%> (ø)
src/common/libkvs/jansson_dirent.c 73.33% <73.33%> (ø)
src/common/libkvs/kvs_classic.c 84.54% <82.97%> (-3.34%) ⬇️
src/common/libkvs/kvs_watch.c 86.76% <84.45%> (-6.21%) ⬇️
src/common/libkvs/json_dirent.c 95.45% <0%> (-2.28%) ⬇️
src/common/libflux/handle.c 83.67% <0%> (-2.08%) ⬇️
... and 20 more

@garlick garlick force-pushed the kvs_api_cleanup2 branch from 339bc17 to 10d3961 Compare July 14, 2017 22:35
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 78.318% when pulling e605530 on garlick:kvs_api_cleanup2 into a3add43 on flux-framework:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 78.309% when pulling 10d3961 on garlick:kvs_api_cleanup2 into a3add43 on flux-framework:master.

@garlick garlick force-pushed the kvs_api_cleanup2 branch from 10d3961 to 9b389d7 Compare July 14, 2017 23:19
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 78.344% when pulling 9b389d7 on garlick:kvs_api_cleanup2 into a3add43 on flux-framework:master.

@garlick garlick force-pushed the kvs_api_cleanup2 branch from 9b389d7 to 1096b57 Compare July 17, 2017 17:15
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 78.338% when pulling 1096b57 on garlick:kvs_api_cleanup2 into a3add43 on flux-framework:master.

@chu11 chu11 mentioned this pull request Jul 18, 2017
@chu11
Copy link
Member

chu11 commented Jul 18, 2017

At a high level, things look good to me. I think it was a good idea to document the json "null" vs C NULL, which can be confusing.

@garlick
Copy link
Member Author

garlick commented Jul 18, 2017

Doing a bit of tidying up on this PR. Hopefully should be ready for another look by mid morning, then if no objection's I'll squash down the incremental changes and ask for merge consideration.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 78.299% when pulling ca21560 on garlick:kvs_api_cleanup2 into a3add43 on flux-framework:master.

@garlick
Copy link
Member Author

garlick commented Jul 18, 2017

OK, done with cleanup. This is ready for review IMHO.

garlick and others added 3 commits July 18, 2017 11:07
This test appears to be wrong - putting a null key is
equivalent to an unlink, but the test fails when that
actually happens.  The test may have been relying on
an earlier bug where the value of a key became the JSON
string "null" instead of the json value null.

Drop the test for now.
t1000-kvs-basic.t calls
  t/kvs/watch simulwatch <key> 8192

This is a stress test of sorts for the kvs_watch client
code.  Drop the count from 8192 to 256 to avoid a future problem
at exactly 500 (not 499) concurrent watches of the same key
where the test hangs.  The problem begins occurring after
the jansson rework of kvs_watch - modify the test here to
avoid breaking git bisect.

This may be an actual bug, but since kvs_watch() has significant
other changes pending, and many simulateous watches is not a real
use case anyway, leave it for another time.
Create jansson dirent, a duplicate of json_dirent but using jansson.

This set of functions is used for conversion of json-c to jansson
throughout flux.
@garlick garlick force-pushed the kvs_api_cleanup2 branch from 1914b4e to 33042b8 Compare July 18, 2017 18:28
@garlick
Copy link
Member Author

garlick commented Jul 18, 2017

rebased and added one last test.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 78.333% when pulling 33042b8 on garlick:kvs_api_cleanup2 into 5776aeb on flux-framework:master.

@garlick
Copy link
Member Author

garlick commented Jul 18, 2017

Had one travis builder stall here:

PASS: t2000-wreck.t 5 - wreckrun: propagates current working directory
PASS: t2000-wreck.t 6 - wreckrun: propagates current environment
*** glibc detected *** lua: corrupted double-linked list: 0x0000000000dc2590 ***
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

Hmm. Restarting.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 78.303% when pulling 9e5416d on garlick:kvs_api_cleanup2 into 5776aeb on flux-framework:master.

@grondo
Copy link
Contributor

grondo commented Jul 18, 2017

Maybe it should be obvious, but perhaps the documentation for flux_kvs_txn_link could be improved.

flux_kvs_txn_link() sets key to be a hard link to a KVS tree object looked up with FLUX_KVS_TREEOBJ.

Perhaps two suggestions to make this clearer:

  • Replace the generic const char *json_str argument name in the prototype with const char *treeobj for symmetry with other calls that take treeobj JSON arguments.
  • Instead of referencing only the FLUX_KVS_TREEOBJ enum, would it be ok to reword the above like:

flux_kvs_txn_link() sets key to be a hard link to a KVS tree object obtained with flux_kvs_lookup() or flux_kvs_lookupat() using the FLUX_KVS_TREEOBJ flag.

(or similar)

Also, how does this function compare to the "classic" call flux_kvs_put_treeobj?

@garlick
Copy link
Member Author

garlick commented Jul 18, 2017

Those changes sound good, patch coming.

Also, how does this function compare to the "classic" call flux_kvs_put_treeobj?

Identical except that kvs_put_treeobj() operates on the default transaction rather than an explict one.

@grondo
Copy link
Contributor

grondo commented Jul 18, 2017

Identical except that kvs_put_treeobj() operates on the default transaction rather than an explict one.

Thanks. In a way the name of kvs_put_treeobj was more clear, so I was wondering about the change to simply "link" (sorry if this was already discussed and I forgot!). flux_kvs_txn_link() makes me think of the link(2) syscall so I was expecting the arguments to be similar -- e.g. link (const char *oldpath, const char *newpath), which might have added to the confusion.

@grondo
Copy link
Contributor

grondo commented Jul 18, 2017

The only other question I'll pose is whether the _txn_ in the function names was dropped purposefully for flux_kvs_commit and flux_kvs_commit. If there is any chance "implied transaction" versions of these functions would be needed in the future, we might want to reserve the non-txn names.

@grondo
Copy link
Contributor

grondo commented Jul 18, 2017

Other than those minor questions, I did test out this PR on some of our TOSS2 and 3 systems and no gotchas so I'm willing to merge for you once you're ready.

@garlick
Copy link
Member Author

garlick commented Jul 18, 2017

There wasn't any discussion that I recall. I sort of made that change on my own initiative, and I'm fine with changing it back. I find my thinking on interface design is often a bit warped when I've had my head in the implementation details.

I'll go ahead and change that, and squash the incremental changes. Thanks for the testing and review!

@garlick
Copy link
Member Author

garlick commented Jul 18, 2017

Actually, maybe for symmetry with flux_kvs_lookup(key, flags=FLUX_KVS_TREEOBJ), we should drop the extra function and use a FLUX_KVS_TREEOBJ flag with flux_kvs_txn_put()?

@grondo
Copy link
Contributor

grondo commented Jul 18, 2017

Actually, maybe for symmetry with flux_kvs_lookup(key, flags=FLUX_KVS_TREEOBJ), we should drop the extra function and use a FLUX_KVS_TREEOBJ flag with flux_kvs_txn_put()?

That works for me if it is OK with you!

@grondo
Copy link
Contributor

grondo commented Jul 19, 2017

Even with the above fix (which I think is valid, good catch), there are some other issues being caught by running the lua test t/lua/t1002-kvs.t under valgrind (by modification of fluxometer.lua). Possibly there are more places where the lua bindings should be taking an extra reference on returned objects, though I can't make sense of these errors yet:

==96418== Invalid read of size 8
==96418==    at 0x7436815: ??? (in /usr/lib64/libjansson.so.4.9.0)
==96418==    by 0x74368B8: ??? (in /usr/lib64/libjansson.so.4.9.0)
==96418==    by 0x743B7F8: json_delete (in /usr/lib64/libjansson.so.4.9.0)
==96418==    by 0x6FDBE7C: json_decref (jansson.h:112)
==96418==    by 0x6FDBE7C: flux_msg_destroy (message.c:263)
==96418==    by 0x6FD706D: handle_cb (dispatch.c:445)
==96418==    by 0x6FFBD3C: ev_invoke_pending (ev.c:3314)
==96418==    by 0x7000369: ev_run (ev.c:3717)
==96418==    by 0x6FD6D2E: flux_reactor_run (reactor.c:128)
==96418==    by 0x6D871A8: l_flux_reactor_start (flux-lua.c:1895)
==96418==    by 0x4E3E5A0: ??? (in /usr/lib64/liblua-5.1.so)
==96418==    by 0x4E49228: ??? (in /usr/lib64/liblua-5.1.so)
==96418==    by 0x4E3EA6C: ??? (in /usr/lib64/liblua-5.1.so)
==96418==  Address 0x6213618 is 8 bytes inside a block of size 32 free'd
==96418==    at 0x4C28C64: free (vg_replace_malloc.c:530)
==96418==    by 0x6FE4BEC: json_decref (jansson.h:112)
==96418==    by 0x6FE4BEC: watch_response_cb (kvs_watch.c:227)
==96418==    by 0x6FD6FC3: call_handler (dispatch.c:302)
==96418==    by 0x6FD719E: dispatch_message (dispatch.c:325)
==96418==    by 0x6FD719E: handle_cb (dispatch.c:391)
==96418==    by 0x6FFBD3C: ev_invoke_pending (ev.c:3314)
==96418==    by 0x7000369: ev_run (ev.c:3717)
==96418==    by 0x6FD6D2E: flux_reactor_run (reactor.c:128)
==96418==    by 0x6D871A8: l_flux_reactor_start (flux-lua.c:1895)
==96418==    by 0x4E3E5A0: ??? (in /usr/lib64/liblua-5.1.so)
==96418==    by 0x4E49228: ??? (in /usr/lib64/liblua-5.1.so)
==96418==    by 0x4E3EA6C: ??? (in /usr/lib64/liblua-5.1.so)
==96418==    by 0x4E3E106: ??? (in /usr/lib64/liblua-5.1.so)
==96418==  Block was alloc'd at
==96418==    at 0x4C2928A: malloc (vg_replace_malloc.c:299)
==96418==    by 0x743B0AD: ??? (in /usr/lib64/libjansson.so.4.9.0)
==96418==    by 0x743803A: ??? (in /usr/lib64/libjansson.so.4.9.0)
==96418==    by 0x7437E46: ??? (in /usr/lib64/libjansson.so.4.9.0)
==96418==    by 0x743820C: ??? (in /usr/lib64/libjansson.so.4.9.0)
==96418==    by 0x7438729: json_loads (in /usr/lib64/libjansson.so.4.9.0)
==96418==    by 0x6FDB731: flux_msg_vunpack (message.c:1171)
==96418==    by 0x6FDB7F7: flux_msg_unpack (message.c:1191)
==96418==    by 0x6FE4B45: watch_response_cb (kvs_watch.c:217)
==96418==    by 0x6FD6FC3: call_handler (dispatch.c:302)
==96418==    by 0x6FD719E: dispatch_message (dispatch.c:325)
==96418==    by 0x6FD719E: handle_cb (dispatch.c:391)
==96418==    by 0x6FFBD3C: ev_invoke_pending (ev.c:3314)

So sorry about this mess!

@grondo
Copy link
Contributor

grondo commented Jul 19, 2017

Hm, the above might not be related.. it looks like some possible jansson refcounting issue though.

@chu11
Copy link
Member

chu11 commented Jul 19, 2017

Was rebasing on your branch to prepare my PR #1108. Noticed you axed j_dirent_append() in d0125f3. I use this in my unit tests in PR #1108. AFAICT it'll be the last location it can be used, so I could just create a helper function inside the unit tests. But since I do use it when converting over to jansson, perhaps we can keep j_dirent_append() for now. Then I can refactor it out later on?

@garlick
Copy link
Member Author

garlick commented Jul 19, 2017

@grondo found the bug - tried to decref a json_t obtained with json_unpack ("s:o").

Just pushed the fix and dropped the proposed lua fix (opened #1112 for future consideration, in case it has merit on its own). I'll run the PMI test that was failing in a loop during our group meeting and if it doesn't fail in an hour, we're probably past that one. Whew! Thanks @grondo!

@garlick
Copy link
Member Author

garlick commented Jul 19, 2017

@chu11: ok I'll drop the commit that removes that function if it's more convenient to leave it in for now.

@garlick garlick force-pushed the kvs_api_cleanup2 branch from d1c16c4 to 8f77614 Compare July 19, 2017 20:56
garlick added 14 commits July 19, 2017 13:58
Rework the "write" side of the KVS API in terms of an
explicit transaction object as discussed in flux-framework#1094.
The interface is essentially
- create txn
- append ops to txn:  put, pack, mkdir, unlink, symlink
- commit/fence txn
- destroy txn

Also: rename the KVS_NOMERGE flag to FLUX_KVS_NOMERGE.

Provide a private interface so that unit tests can
examine internal contents of a commit.
Add unit test for kvs_txn.

Also: ensure that unit tests link against the static libkvs.la
rather than ../libflux.so so they pick up changes during
development, and the unit test hook in kvs_txn.c can be
kept private.
Drop the functions
  kvs_fence_set_context()
  kvs_fence_clear_context()

These were added so that a program could build up a fence
transaction while simultaneously performing other KVS commits.
That use case is now handled through explicit transactions.

Drop those functions and convert wrexecd and tests.
Reimplement kvs_put() and kvsdir_put() and associated
functions in terms of txn's.  Create a "default transaction"
that these functions add to, and have kvs_commit() and kvs_fence()
commit the default transaction.

Update tests and wrexecd.

Also: drop some unnecessary includes from kvs.c.
Replace calls to xstrdup() and xzmalloc() that exit on out
of memory error with strdup() and calloc(), and return ENOMEM
errors to the caller if they occur.

Drop some internal #includes that were there for no
apparent reason.
Convert kvs_watch() internals from json-c to jansson.

This conversion may have introduced a bug that causes
t/kvs/watch simulwatch with ntimes > 499 to hang.

For code simplification, one new thing here is that each
watch is assigned a future.  After the first response, the
future is fulfilled, but is kept around to prevent the matchtag
from being freed.  This simplified the code for performing the
initial RPC, but does increase the amount of state kept per
watch.  We really need a more well thought out way to handle
the "one request - multiple response" use case.

After the first response that fulfills the future, a generic
response message handler is used for subsequent responses.
A flux_kvs_txn_put (key, "null") should result in an EINVAL.
The t/kvs/basic put subcommand retries an EINVAL with the value
converted to a string.

Add a test that ensures the value is set to a string rather
than a JSON null, which is an invalid directory entry.
Avoid using proto.c::kp_twatch_enc(), and instead use
flux_mrpcf() to encode the mrpc request.  This prepares
for removal of proto.[ch] in a future commit.

Also, drop remaining json-c use in this test, decoding mrpc
stats response using flux_mrpc_getf().
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 78.317% when pulling d1c16c4 on garlick:kvs_api_cleanup2 into 5776aeb on flux-framework:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 78.297% when pulling 8f77614 on garlick:kvs_api_cleanup2 into 5776aeb on flux-framework:master.

@garlick
Copy link
Member Author

garlick commented Jul 19, 2017

I dropped the commit @chu11 wanted dropped, and verified the PMI test has been running in a loop for about 45m with no failures (where before it usually failed within a few minutes). I think this one's perhaps ready?

@grondo
Copy link
Contributor

grondo commented Jul 19, 2017

Yeah, looks good to me!

@grondo grondo merged commit 500e366 into flux-framework:master Jul 19, 2017
@garlick garlick deleted the kvs_api_cleanup2 branch July 19, 2017 21:49
@grondo grondo mentioned this pull request Aug 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants