content-sqlite, content-cache: cleanup and refactoring #2786

garlick · 2020-02-28T23:02:36Z

This is primarily a refactoring and cleanup of the content-sqlite module, with a bit of content-cache involvement, in preparation for changes to come for KVS checkpoint/restart.

Splitting a (hopefully) easily digestible chunk from #2783

garlick · 2020-02-28T23:37:48Z

Restarted ASAN builder that failed here

FAIL: t2402-job-exec-dummy.t 4 - job-exec: job shell failure recorded
    t2402-job-exec-dummy.t:  FAIL: N=10  PASS=9   FAIL=1 SKIP=0 XPASS=0 XFAIL=0
ERROR: t2402-job-exec-dummy.t - exited with status 1

grondo · 2020-02-28T23:52:16Z

FAIL: t2402-job-exec-dummy.t 4 - job-exec: job shell failure recorded

This might be the same issue @chu11 hit in #2777.

grondo

I ran on this branch for awhile without issue, and did a quick review of the code (as much as I was capable). It looks good to me.

garlick · 2020-03-02T19:16:03Z

Thanks! Not sure what went wrong with LGTM. It seemed to have stopped during checkout? I told it to rerun the C/C++ check and the output window is not doing anything. Hmm, maybe it will sort itself out over lunch.

garlick · 2020-03-02T21:16:52Z

Hmm, one builder failed a test in t2501-job-status.t 11

test_expect_code: command exited with 1, we wanted 143 flux job status -v 32396804096
not ok 11 - status: returns 143 (SIGTERM) for canceled running job

expecting success: 
	test_expect_code 130 flux job status -vv ${zero} ${one} ${sigint}

flux-job: fetching status for 3 jobs
flux-job: all done.
flux-job: 12415139840: exited with exit code 0
flux-job: 16693329920: exited with exit code 1
flux-job: 21072183296: exited with exit code 130

grondo · 2020-03-02T21:24:59Z

Hmm, one builder failed a test in t2501-job-status.t 11

Let me check that out to see if I've introduced a race there.

grondo · 2020-03-02T21:32:03Z

Let me check that out to see if I've introduced a race there.

That test synchronizes on the job.start event, but I wonder if it should synchronize on exec.shell.start instead to ensure that the shell/tasks are killed by SIGTERM.

grondo · 2020-03-02T21:50:31Z

I could not get the test to fail locally, but I wonder if this commit would fix it:

commit 6599e4e6a5fc08a90162555ccc6598f4c2e82497
Author: Mark A. Grondona <[email protected]>
Date:   Mon Mar 2 13:45:13 2020 -0800

    testsuite: fix racy test in t2501-job-status.t
    
    Problem: In t2501-job-status.t, a job is submitted and killed once
    the 'start' event has reached the job eventlog, but evidence from
    Travis-CI runs indicates even this may be racy, since the test failed
    reporting exit code 1 instead of 143 (SIGTERM).
    
    Instead of waiting for just the 'start' event, wait for the 'shell.start'
    event in the guest exec eventlog. This will ensure that all sleep
    processes have been started, which should more reliably report the
    expected exit status with `flux job status`.

diff --git a/t/t2501-job-status.t b/t/t2501-job-status.t
index baabd3f..3f2da68 100755
--- a/t/t2501-job-status.t
+++ b/t/t2501-job-status.t
@@ -13,8 +13,6 @@ test_expect_success 'status: submit a series of jobs' '
        shell_sigquit=$(flux mini submit sh -c "kill -QUIT \$PPID") &&
        unsatisfiable=$(flux mini submit -n 1024 hostname) &&
        killed=$(flux mini submit sleep 600) &&
-       flux job wait-event ${killed} start &&
-       flux job cancel ${killed} &&
        flux queue stop &&
        canceled=$(flux mini submit -n 1024 hostname) &&
        flux job cancel ${canceled} &&
@@ -49,6 +47,8 @@ test_expect_success 'status: --exception-exit-code works' '
        test_expect_code 255 flux job status -v --exception-exit-code=255 ${unsa
 '
 test_expect_success 'status: returns 143 (SIGTERM) for canceled running job' '
+       flux job wait-event -p guest.exec.eventlog ${killed} shell.start &&
+       flux job cancel ${killed} &&
        test_expect_code 143 flux job status -v ${killed}
 '
 test_expect_success 'status: returns highest status for multiple jobs' '

I can open a PR for this small fix, or feel free to push it directly on this PR, if it isn't too inapropos.

Some general cleanup of "context" create/destroy: - don't create a typedef where one is not needed - don't store "context" in the flux_t handle aux hash when create/destroy can occur explicitly in the module main - factor out database open/close operations to separate functions

Problem: module code can no longer pthread_cancel() a module thread, so protection of critical sections is no longer needed. Drop calls to pthread_setcancelstate().

Minor cleanup: - when functions calls must be split into multiple lines, break the line for each parameter - space between function name and parenthesized parameter list

Decouple sqlite pragma operations into separate blocks for readability and more detailed logging on failure.

Problem: errors are not checked when the database is closed. Catch any errors and log them.

Problem: content-sqlite needlessly performs a save/restore of errno around flux_future_destroy(), but that function does not modify errno. Drop the save/restore code.

Factor sqlite portion of load/store ooperations out of message handers to their own functions. This improves clarity and lets them be used for other purposes.

Problem: t0011-content-cache.t produces many lines consisting only of blobrefs on stdout. Redirect test output to /dev/null. It is not useful.

Drop comment about backing store reload that is no longer true.

Problem: message handler table is too scrunched up to accept longer handler names without reformatting. Spread out struct initialization to one field per line.

Problem: the content.backing RPC that both registers and unregisters a backing store is awkward to use and badly named. Split into two RPCs: content.register-backing and content.unregister-backing for clarity. Update content-sqlite and sharness test.

Problem: if the backing store is unloaded with dirty cache entries still present in the rank 0 cache, they can't be saved, and this could result in data loss. Log unflushables at LOG_ERR level so there is at least a log message if that happens.

Problem: flux-content(1) describes old behavior of content backing store module at unload time, where content was copied from the backing store to the memory cache. This is no longer done. Update the description to reflect the way it works now.

Problem: In t2501-job-status.t, a job is submitted and killed once the 'start' event has reached the job eventlog, but evidence from Travis-CI runs indicates even this may be racy, since the test failed reporting exit code 1 instead of 143 (SIGTERM). Instead of waiting for just the 'start' event, wait for the 'shell.start' event in the guest exec eventlog. This will ensure that all sleep processes have been started, which should more reliably report the expected exit status with `flux job status`.

garlick · 2020-03-02T22:04:55Z

Thanks, just forced a push with your commit tacked on.

codecov-io · 2020-03-02T22:39:51Z

Codecov Report

Merging #2786 into master will decrease coverage by 0.04%.
The diff coverage is 58.16%.

@@            Coverage Diff             @@
##           master    #2786      +/-   ##
==========================================
- Coverage   81.06%   81.02%   -0.05%     
==========================================
  Files         250      250              
  Lines       39428    39460      +32     
==========================================
+ Hits        31964    31971       +7     
- Misses       7464     7489      +25

Impacted Files	Coverage Δ
src/modules/content-sqlite/content-sqlite.c	`55% <56.39%> (-3.83%)`	⬇️
src/broker/content-cache.c	`65.22% <70.83%> (+0.54%)`	⬆️
src/modules/job-info/list.c	`66.49% <0%> (-2.07%)`	⬇️
src/modules/job-info/guest_watch.c	`75.57% <0%> (-0.58%)`	⬇️
src/modules/job-info/job_state.c	`70.06% <0%> (-0.47%)`	⬇️
src/modules/job-exec/job-exec.c	`74.72% <0%> (+0.22%)`	⬆️
src/modules/job-info/watch.c	`72.53% <0%> (+1.55%)`	⬆️

Problem: The 'sql_dump' prepared statement is unused, left over from code removed in flux-framework#2786. Remove unused prepared statement.

grondo approved these changes Mar 2, 2020

View reviewed changes

grondo added the merge-when-passing label Mar 2, 2020

grondo force-pushed the content_cleanup branch from 4847172 to ccd839a Compare March 2, 2020 18:54

garlick and others added 14 commits March 2, 2020 13:52

modules/content-sqlite: drop pthread_setcancelstate calls

122475a

Problem: module code can no longer pthread_cancel() a module thread, so protection of critical sections is no longer needed. Drop calls to pthread_setcancelstate().

modules/content-sqlite: [cleanup] adjust whitespace

0fb9544

Minor cleanup: - when functions calls must be split into multiple lines, break the line for each parameter - space between function name and parenthesized parameter list

modules/content-sqlite: error handling cleanup

8b93dd8

Decouple sqlite pragma operations into separate blocks for readability and more detailed logging on failure.

modules/content-sqlite: check errors during teardown

014b07a

Problem: errors are not checked when the database is closed. Catch any errors and log them.

modules/content-sqlite: [cleanup] drop errno save/restore

7e98db0

Problem: content-sqlite needlessly performs a save/restore of errno around flux_future_destroy(), but that function does not modify errno. Drop the save/restore code.

modules/content-sqlite: [cleanup] refactor load/store

d1cddc6

Factor sqlite portion of load/store ooperations out of message handers to their own functions. This improves clarity and lets them be used for other purposes.

testsuite: reduce output of content-cache test

69e66c5

Problem: t0011-content-cache.t produces many lines consisting only of blobrefs on stdout. Redirect test output to /dev/null. It is not useful.

broker/content-cache: drop outdated comment

9caec64

Drop comment about backing store reload that is no longer true.

broker/content-cache: [cleanup] respace handler table

334b276

Problem: message handler table is too scrunched up to accept longer handler names without reformatting. Spread out struct initialization to one field per line.

broker/content-cache: log unflushables

63a1791

Problem: if the backing store is unloaded with dirty cache entries still present in the rank 0 cache, they can't be saved, and this could result in data loss. Log unflushables at LOG_ERR level so there is at least a log message if that happens.

garlick force-pushed the content_cleanup branch from ccd839a to efcd9e8 Compare March 2, 2020 22:04

mergify bot merged commit 7d4fc7c into flux-framework:master Mar 2, 2020

garlick deleted the content_cleanup branch March 3, 2020 00:10

garlick mentioned this pull request Mar 4, 2020

add KVS checkpoints #2783

Merged

garlick added a commit to garlick/flux-core that referenced this pull request Mar 4, 2020

content-sqlite: drop unused dump prepared stmt

fd987cd

Problem: The 'sql_dump' prepared statement is unused, left over from code removed in flux-framework#2786. Remove unused prepared statement.

garlick added a commit to garlick/flux-core that referenced this pull request Mar 6, 2020

content-sqlite: drop unused dump prepared stmt

1ef979b

Problem: The 'sql_dump' prepared statement is unused, left over from code removed in flux-framework#2786. Remove unused prepared statement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content-sqlite, content-cache: cleanup and refactoring #2786

content-sqlite, content-cache: cleanup and refactoring #2786

garlick commented Feb 28, 2020

garlick commented Feb 28, 2020

grondo commented Feb 28, 2020

grondo left a comment

garlick commented Mar 2, 2020

garlick commented Mar 2, 2020

grondo commented Mar 2, 2020

grondo commented Mar 2, 2020

grondo commented Mar 2, 2020

garlick commented Mar 2, 2020

codecov-io commented Mar 2, 2020

content-sqlite, content-cache: cleanup and refactoring #2786

content-sqlite, content-cache: cleanup and refactoring #2786

Conversation

garlick commented Feb 28, 2020

garlick commented Feb 28, 2020

grondo commented Feb 28, 2020

grondo left a comment

Choose a reason for hiding this comment

garlick commented Mar 2, 2020

garlick commented Mar 2, 2020

grondo commented Mar 2, 2020

grondo commented Mar 2, 2020

grondo commented Mar 2, 2020

garlick commented Mar 2, 2020

codecov-io commented Mar 2, 2020

Codecov Report