Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KVS oom() and xzmalloc() cleanup #1124

Merged
merged 22 commits into from
Jul 25, 2017

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Jul 25, 2017

Like other refactorings that have been going on in flux-core, this one removes a large number of oom() calls and functions that call oom() (notably xzmalloc() and xstrdup()). Instead
ENOMEM is detected, properly return as an error, check for error, log error, etc.

A few oom()s remain (wait_runqueue() and the merging ops functions). Those will require a tad more re-architecture and logic updates so I will put those in another PR. The changes in this PR are relatively more grunted out fixes without much thought except for cleanup path handling.

I suspect the coverage on this diff will be bad. I'd be surprised if it was even greater than 50% coverage, as almost the entirety of it is handling ENOMEM errors and the cascading checks for functions that can now return an ENOMEM error.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.5%) to 77.991% when pulling 2341204 on chu11:kvs_oom_xzmalloc_cleanup-try4 into ccf25dc on flux-framework:master.

@codecov-io
Copy link

codecov-io commented Jul 25, 2017

Codecov Report

Merging #1124 into master will decrease coverage by 0.49%.
The diff coverage is 41.27%.

@@            Coverage Diff            @@
##           master    #1124     +/-   ##
=========================================
- Coverage   78.07%   77.57%   -0.5%     
=========================================
  Files         184      184             
  Lines       31430    31668    +238     
=========================================
+ Hits        24538    24567     +29     
- Misses       6892     7101    +209
Impacted Files Coverage Δ
src/common/libkvs/jansson_dirent.c 64.44% <25%> (-16.12%) ⬇️
src/modules/kvs/fence.c 77.77% <28.57%> (-4.58%) ⬇️
src/modules/kvs/lookup.c 83.68% <34.88%> (-7.69%) ⬇️
src/modules/kvs/kvs.c 69.12% <36.66%> (-6.24%) ⬇️
src/modules/kvs/commit.c 73.08% <41.41%> (-8.74%) ⬇️
src/modules/kvs/waitqueue.c 81.13% <45%> (-5.33%) ⬇️
src/modules/kvs/cache.c 91.3% <62.85%> (-1.94%) ⬇️
src/modules/kvs/kvs_util.c 73.91% <64.7%> (-4.35%) ⬇️
t/mpi/hello.c 93.54% <0%> (-6.46%) ⬇️
src/common/libkvs/kvs.c 70.8% <0%> (-6.1%) ⬇️
... and 16 more

@chu11
Copy link
Member Author

chu11 commented Jul 25, 2017

hmmm. the 40.27% dff target isn't too surprising. But the project coverage going down 0.53% seems rather high. But then again, that's only 156 lines of code, which doesn't seem too alarming given all of the ENOMEM paths. I'll see if I can tweak anything and get the coverage better.

@grondo
Copy link
Contributor

grondo commented Jul 25, 2017

Yeah, the unchecked paths mostly seem like error paths, and it is very difficult to get coverage there. However, it would seem like some of these error paths would be really good to check, and for example cleanup_dirty_cache_entry is not called during testing. Could you think of any way to trigger an error in any these paths for testing purposes, and ensure the cleanup works as expected?

@grondo
Copy link
Contributor

grondo commented Jul 25, 2017

I just peeked at this real quick to check on the coverage report, but I really appreciate some of the nice code comments you added in tricky sections. Really helpful for someone browsing the code. Thanks!

@garlick
Copy link
Member

garlick commented Jul 25, 2017

Nice work @chu11 as usual!

I haven't looked in detail at how it might apply here but we do have the debug hooks for modules that are set with flux module debug. E.g. see some of the tests in t0017-security.t that tweak the connector-local and userdb modules. Possibly some test bits for the kvs module could be used to trigger some of the hairier error paths.

@chu11
Copy link
Member Author

chu11 commented Jul 25, 2017

@grondo Yeah, the cleanup_dirty_cache_entry in kvs.c did stick out a bit. I think if I somehow make a common function along with cleanup_dirty_cache_entry in commit.c then I'll atleast have confidence that the cleanup would work (it is covered via the commit.c specific unit tests).

@chu11
Copy link
Member Author

chu11 commented Jul 25, 2017

pushed a refactor creating a common function commit_cleanup_dirty_cache_entry, and added a unit test for a general missed case. May not matter much, but hopefully the net change is a bit less.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 78.014% when pulling 146b62e on chu11:kvs_oom_xzmalloc_cleanup-try4 into ccf25dc on flux-framework:master.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few minor comments but on the whole this is a really positive set of changes.

goto done;
}
if (!blobref || blobref[blobref_size - 1] != '\0') {
errno = EPROTO;
flux_log_error (ctx->h, "%s", __FUNCTION__);
flux_log_error (ctx->h, "%s: EPROTO", __FUNCTION__);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to print "content_store_get: EPROTO: Protocol Error". Maybe change "EPROTO" to "invalid blobref"?

oom ();
if (!(s = json_dumps (tmp, flags)))
oom ();
if (!(tmp = json_null ())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe just replace json_null() and json_dumps() calls with strdup ("null")?

bool rc = strlen (s) > BLOBREF_MAX_STRING_SIZE;
free (s);
return rc;
return rc ? 1 : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just make rc an int and set it before the free, then return it directly?

bool rc = strlen (s) > BLOBREF_MAX_STRING_SIZE;
free (s);
return rc;
return rc ? 1 : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: make new utility function that calls kvs_util_json_dumps() internally:

// returns 0 on success, -1 on failure
int kvs_util_json_encoded_size (json_t *o, size_t *size);

then below, use it to set a new variable size_t value_encoded_size, and test it for > BLOBREF_MAX_STRING_SIZE right there in commit_unroll().

Not a big deal, but might be a little clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, I see what you mean. Yeah, that'll probably clarify things a bit. But more importantly it'll use kvs_util_json_dumps(), which I had missed earlier.

/* Return true if load successful, false if stalling */
static bool load (kvs_ctx_t *ctx, const href_t ref, wait_t *wait, json_t **op)
/* Return 1 if load successful, 0 if stalling, -1 on error */
static int load (kvs_ctx_t *ctx, const href_t ref, wait_t *wait, json_t **op)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer to add a stalled parameter to the function and just return 0 on success, -1 on failure?

@@ -253,10 +254,13 @@ static int commit_unroll (commit_t *c, int current_epoch, json_t *dir)
goto done;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment block describing return value for commit_unroll() needs an update (still refers to boolean return).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, this is all the way back from my commit API refactoring. Will add a cleanup.

@@ -325,9 +332,11 @@ static int commit_link_dirent (commit_t *c, int current_epoch,
errno = ENOMEM;
goto done;
}
if (json_object_set_new (dir,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is workable, but would it be easier to just dispense with j_dirent_create() and call json_pack() in all the places where it's used? It doesn't really do very much, and it might be easier to keep track of refcounts and error behavior if we were more direct. Ignore if that's out of scope here.

@chu11
Copy link
Member Author

chu11 commented Jul 25, 2017

Just pushed fixes for all the tiny things @garlick found. If looks good, can squash and then rebase on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 78.042% when pulling 3df439d on chu11:kvs_oom_xzmalloc_cleanup-try4 into ccf25dc on flux-framework:master.

@garlick
Copy link
Member

garlick commented Jul 25, 2017

Those changes look good!

@chu11 chu11 force-pushed the kvs_oom_xzmalloc_cleanup-try4 branch from 3df439d to 2df537e Compare July 25, 2017 18:14
@chu11
Copy link
Member Author

chu11 commented Jul 25, 2017

just pushed a rebased + squashed tree. Didn't squash every cleanup patch, as some of them seemed more like fixes/cleanup from the prior PRs, so they should stay separate.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 78.031% when pulling 2df537e on chu11:kvs_oom_xzmalloc_cleanup-try4 into ccf25dc on flux-framework:master.

@chu11
Copy link
Member Author

chu11 commented Jul 25, 2017

hmmm, getting valgrind errors. But I only squashed and rebased compared to the last push. Wonder what I did.

@grondo
Copy link
Contributor

grondo commented Jul 25, 2017

Hm, that is a bit scary. I hope it is not a strange false positive.

@chu11
Copy link
Member Author

chu11 commented Jul 25, 2017

I think I may have just read the travis results wrong. Normally "coverage/coveralls" is the first one listed and "continuous-integration/travis-ci" is second. I saw the red x and green checkmark and assumed travis had passed and coverage was bad. Think I found the mem-leak I introduced, and perhaps another minor cleanup I can fix.

@chu11 chu11 force-pushed the kvs_oom_xzmalloc_cleanup-try4 branch from 2df537e to cb64774 Compare July 25, 2017 20:47
@chu11
Copy link
Member Author

chu11 commented Jul 25, 2017

re-pushed, i forgot to free memory in the new kvs_util_json_encoded_size() function. That's the only change.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 78.036% when pulling cb64774 on chu11:kvs_oom_xzmalloc_cleanup-try4 into ccf25dc on flux-framework:master.

chu11 added 7 commits July 25, 2017 15:24
Add flux log error messages in places there should be log messages.

Fix up flux log error message formats for consistency.
Move log function above asserts, so log message will definitely
occur.
Begin refactoring removal of oom () calls.  Begin by removing oom ()
calls where it is easy to simply replace with error responses.  i.e.
functions that already return errors and callers check for errors, so
slipping in a new return of ENOMEM is relatively simple.

Remove include of oom.h where appropriate.
Begin refactoring away functions that will oom() and exit on
ENOMEM errors.  Replace calls of xstrdup(), xzmalloc(), and
xasprintf() with strdup(), calloc(), and asprintf() respectively
and return ENOMEM to callers appropriately.  Begin by replacing
where it is easy to simply replace with error responses.  i.e
functions that already return errors and callers check for errors,
so slipping in a new return of ENOMEM is relatively simple.

Remove header include of xzmalloc.h where appropriate.
kvs_util_json_dumps() and kvs_util_json_hash() can now return ENOMEM
as an error.

Adjust callers of these functions appropriately to handle error.
Have cache_get_stats() and cache_expire_entries() return -1 on error.

Adjust callers appropriately to handle potential error.
chu11 added 15 commits July 25, 2017 15:24
When putting a dirty cache entry in the item callback list, do
not oom(), instead cleanup the dirty cache entry and return
ENOMEM appropriately.
Have fileval_big() return -1 on error, 0 on false, 1 on true.

Adjust callers appropriately to handle potential error.
Refactor load() function so it can return an error instead of only
returning true/false on load/stall.
Have wait_queue_create() return NULL on ENOMEM.

The potential for error in wait_queue_create() cascades to
cache_entry_wait_notdirty() and cache_entry_wait_valid(), which now
can return errors.  Update all callers of cache_entry_wait_notdirty()
and cache_entry_wait_valid() to handle potential errors.

Refactor cleanup_dirty_cache_entry() in kvs.c and commit.c into
commit_cleanup_dirty_cache_entry() common function.

Update unit tests for cache_entry_wait_notdirty() and cache_entry_wait_valid().
Add unit test for commit_cleanup_dirty_cache_entry().
Do not oom() on errors in j_dirent_create(), return ENOMEM appropriately.

Fix all callers of j_dirent_create() to check for error and handle
error appropriately.
With refactor / change of j_dirent_create(), now remove oom() call
from json_object_set_new() calls and return error appropriately.
Call calloc() over xzmalloc() in cache_entry_create() and return
ENOMEM appropriately.  Update all callers to handle error appropriately.
Replace kvs_util_json_copydir() with json_copy() and check for
ENOMEM errors appropriately.

Remove unnecessary kvs_util.h includes.

Update unit tests appropriately.
Return -1 on error from wait_addqueue() and 0 on success.

Update all callers to check for error appropriately.

Update unit tests appropriately.
Handle errors appropriately and update callers to handle errors.
Add missing unit test for removing object from cache entry
Instead of using two jansson calls, just strdup the string "null".
Add new kvs_util_json_encoded_size() function determine the
size of an object consistent to the format used by kvs_util_json_dumps().

Use this function instead of fileval_big() in commit API.

Add unit tests appropriately.
@chu11 chu11 force-pushed the kvs_oom_xzmalloc_cleanup-try4 branch from cb64774 to e464830 Compare July 25, 2017 22:24
@chu11
Copy link
Member Author

chu11 commented Jul 25, 2017

rebased on master and re-pushed

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 78.018% when pulling e464830 on chu11:kvs_oom_xzmalloc_cleanup-try4 into a132370 on flux-framework:master.

@garlick
Copy link
Member

garlick commented Jul 25, 2017

Finally through travis! Ready @chu11?

@chu11
Copy link
Member Author

chu11 commented Jul 25, 2017

yup!

@garlick garlick merged commit 5f48548 into flux-framework:master Jul 25, 2017
@grondo grondo mentioned this pull request Aug 23, 2017
@chu11 chu11 deleted the kvs_oom_xzmalloc_cleanup-try4 branch June 5, 2021 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants