-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
content-sqlite, content-cache: cleanup and refactoring #2786
Conversation
Restarted ASAN builder that failed here
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran on this branch for awhile without issue, and did a quick review of the code (as much as I was capable). It looks good to me.
Thanks! Not sure what went wrong with LGTM. It seemed to have stopped during checkout? I told it to rerun the C/C++ check and the output window is not doing anything. Hmm, maybe it will sort itself out over lunch. |
Hmm, one builder failed a test in
|
Let me check that out to see if I've introduced a race there. |
That test synchronizes on the |
I could not get the test to fail locally, but I wonder if this commit would fix it: commit 6599e4e6a5fc08a90162555ccc6598f4c2e82497
Author: Mark A. Grondona <[email protected]>
Date: Mon Mar 2 13:45:13 2020 -0800
testsuite: fix racy test in t2501-job-status.t
Problem: In t2501-job-status.t, a job is submitted and killed once
the 'start' event has reached the job eventlog, but evidence from
Travis-CI runs indicates even this may be racy, since the test failed
reporting exit code 1 instead of 143 (SIGTERM).
Instead of waiting for just the 'start' event, wait for the 'shell.start'
event in the guest exec eventlog. This will ensure that all sleep
processes have been started, which should more reliably report the
expected exit status with `flux job status`.
diff --git a/t/t2501-job-status.t b/t/t2501-job-status.t
index baabd3f..3f2da68 100755
--- a/t/t2501-job-status.t
+++ b/t/t2501-job-status.t
@@ -13,8 +13,6 @@ test_expect_success 'status: submit a series of jobs' '
shell_sigquit=$(flux mini submit sh -c "kill -QUIT \$PPID") &&
unsatisfiable=$(flux mini submit -n 1024 hostname) &&
killed=$(flux mini submit sleep 600) &&
- flux job wait-event ${killed} start &&
- flux job cancel ${killed} &&
flux queue stop &&
canceled=$(flux mini submit -n 1024 hostname) &&
flux job cancel ${canceled} &&
@@ -49,6 +47,8 @@ test_expect_success 'status: --exception-exit-code works' '
test_expect_code 255 flux job status -v --exception-exit-code=255 ${unsa
'
test_expect_success 'status: returns 143 (SIGTERM) for canceled running job' '
+ flux job wait-event -p guest.exec.eventlog ${killed} shell.start &&
+ flux job cancel ${killed} &&
test_expect_code 143 flux job status -v ${killed}
'
test_expect_success 'status: returns highest status for multiple jobs' '
I can open a PR for this small fix, or feel free to push it directly on this PR, if it isn't too inapropos. |
Some general cleanup of "context" create/destroy: - don't create a typedef where one is not needed - don't store "context" in the flux_t handle aux hash when create/destroy can occur explicitly in the module main - factor out database open/close operations to separate functions
Problem: module code can no longer pthread_cancel() a module thread, so protection of critical sections is no longer needed. Drop calls to pthread_setcancelstate().
Minor cleanup: - when functions calls must be split into multiple lines, break the line for each parameter - space between function name and parenthesized parameter list
Decouple sqlite pragma operations into separate blocks for readability and more detailed logging on failure.
Problem: errors are not checked when the database is closed. Catch any errors and log them.
Problem: content-sqlite needlessly performs a save/restore of errno around flux_future_destroy(), but that function does not modify errno. Drop the save/restore code.
Factor sqlite portion of load/store ooperations out of message handers to their own functions. This improves clarity and lets them be used for other purposes.
Problem: t0011-content-cache.t produces many lines consisting only of blobrefs on stdout. Redirect test output to /dev/null. It is not useful.
Drop comment about backing store reload that is no longer true.
Problem: message handler table is too scrunched up to accept longer handler names without reformatting. Spread out struct initialization to one field per line.
Problem: the content.backing RPC that both registers and unregisters a backing store is awkward to use and badly named. Split into two RPCs: content.register-backing and content.unregister-backing for clarity. Update content-sqlite and sharness test.
Problem: if the backing store is unloaded with dirty cache entries still present in the rank 0 cache, they can't be saved, and this could result in data loss. Log unflushables at LOG_ERR level so there is at least a log message if that happens.
Problem: flux-content(1) describes old behavior of content backing store module at unload time, where content was copied from the backing store to the memory cache. This is no longer done. Update the description to reflect the way it works now.
Problem: In t2501-job-status.t, a job is submitted and killed once the 'start' event has reached the job eventlog, but evidence from Travis-CI runs indicates even this may be racy, since the test failed reporting exit code 1 instead of 143 (SIGTERM). Instead of waiting for just the 'start' event, wait for the 'shell.start' event in the guest exec eventlog. This will ensure that all sleep processes have been started, which should more reliably report the expected exit status with `flux job status`.
Thanks, just forced a push with your commit tacked on. |
Codecov Report
@@ Coverage Diff @@
## master #2786 +/- ##
==========================================
- Coverage 81.06% 81.02% -0.05%
==========================================
Files 250 250
Lines 39428 39460 +32
==========================================
+ Hits 31964 31971 +7
- Misses 7464 7489 +25
|
Problem: The 'sql_dump' prepared statement is unused, left over from code removed in flux-framework#2786. Remove unused prepared statement.
Problem: The 'sql_dump' prepared statement is unused, left over from code removed in flux-framework#2786. Remove unused prepared statement.
This is primarily a refactoring and cleanup of the content-sqlite module, with a bit of content-cache involvement, in preparation for changes to come for KVS checkpoint/restart.
Splitting a (hopefully) easily digestible chunk from #2783