Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content-sqlite, content-cache: cleanup and refactoring #2786

Merged
merged 14 commits into from
Mar 2, 2020

Conversation

garlick
Copy link
Member

@garlick garlick commented Feb 28, 2020

This is primarily a refactoring and cleanup of the content-sqlite module, with a bit of content-cache involvement, in preparation for changes to come for KVS checkpoint/restart.

Splitting a (hopefully) easily digestible chunk from #2783

@garlick
Copy link
Member Author

garlick commented Feb 28, 2020

Restarted ASAN builder that failed here

FAIL: t2402-job-exec-dummy.t 4 - job-exec: job shell failure recorded
    t2402-job-exec-dummy.t:  FAIL: N=10  PASS=9   FAIL=1 SKIP=0 XPASS=0 XFAIL=0
ERROR: t2402-job-exec-dummy.t - exited with status 1

@grondo
Copy link
Contributor

grondo commented Feb 28, 2020

FAIL: t2402-job-exec-dummy.t 4 - job-exec: job shell failure recorded

This might be the same issue @chu11 hit in #2777.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran on this branch for awhile without issue, and did a quick review of the code (as much as I was capable). It looks good to me.

@garlick
Copy link
Member Author

garlick commented Mar 2, 2020

Thanks! Not sure what went wrong with LGTM. It seemed to have stopped during checkout? I told it to rerun the C/C++ check and the output window is not doing anything. Hmm, maybe it will sort itself out over lunch.

@garlick
Copy link
Member Author

garlick commented Mar 2, 2020

Hmm, one builder failed a test in t2501-job-status.t 11

test_expect_code: command exited with 1, we wanted 143 flux job status -v 32396804096
not ok 11 - status: returns 143 (SIGTERM) for canceled running job

expecting success: 
	test_expect_code 130 flux job status -vv ${zero} ${one} ${sigint}

flux-job: fetching status for 3 jobs
flux-job: all done.
flux-job: 12415139840: exited with exit code 0
flux-job: 16693329920: exited with exit code 1
flux-job: 21072183296: exited with exit code 130

@grondo
Copy link
Contributor

grondo commented Mar 2, 2020

Hmm, one builder failed a test in t2501-job-status.t 11

Let me check that out to see if I've introduced a race there.

@grondo
Copy link
Contributor

grondo commented Mar 2, 2020

Let me check that out to see if I've introduced a race there.

That test synchronizes on the job.start event, but I wonder if it should synchronize on exec.shell.start instead to ensure that the shell/tasks are killed by SIGTERM.

@grondo
Copy link
Contributor

grondo commented Mar 2, 2020

I could not get the test to fail locally, but I wonder if this commit would fix it:

commit 6599e4e6a5fc08a90162555ccc6598f4c2e82497
Author: Mark A. Grondona <[email protected]>
Date:   Mon Mar 2 13:45:13 2020 -0800

    testsuite: fix racy test in t2501-job-status.t
    
    Problem: In t2501-job-status.t, a job is submitted and killed once
    the 'start' event has reached the job eventlog, but evidence from
    Travis-CI runs indicates even this may be racy, since the test failed
    reporting exit code 1 instead of 143 (SIGTERM).
    
    Instead of waiting for just the 'start' event, wait for the 'shell.start'
    event in the guest exec eventlog. This will ensure that all sleep
    processes have been started, which should more reliably report the
    expected exit status with `flux job status`.

diff --git a/t/t2501-job-status.t b/t/t2501-job-status.t
index baabd3f..3f2da68 100755
--- a/t/t2501-job-status.t
+++ b/t/t2501-job-status.t
@@ -13,8 +13,6 @@ test_expect_success 'status: submit a series of jobs' '
        shell_sigquit=$(flux mini submit sh -c "kill -QUIT \$PPID") &&
        unsatisfiable=$(flux mini submit -n 1024 hostname) &&
        killed=$(flux mini submit sleep 600) &&
-       flux job wait-event ${killed} start &&
-       flux job cancel ${killed} &&
        flux queue stop &&
        canceled=$(flux mini submit -n 1024 hostname) &&
        flux job cancel ${canceled} &&
@@ -49,6 +47,8 @@ test_expect_success 'status: --exception-exit-code works' '
        test_expect_code 255 flux job status -v --exception-exit-code=255 ${unsa
 '
 test_expect_success 'status: returns 143 (SIGTERM) for canceled running job' '
+       flux job wait-event -p guest.exec.eventlog ${killed} shell.start &&
+       flux job cancel ${killed} &&
        test_expect_code 143 flux job status -v ${killed}
 '
 test_expect_success 'status: returns highest status for multiple jobs' '

I can open a PR for this small fix, or feel free to push it directly on this PR, if it isn't too inapropos.

garlick and others added 14 commits March 2, 2020 13:52
Some general cleanup of "context" create/destroy:
- don't create a typedef where one is not needed
- don't store "context" in the flux_t handle aux hash
  when create/destroy can occur explicitly in the module main
- factor out database open/close operations to separate functions
Problem: module code can no longer pthread_cancel() a
module thread, so protection of critical sections is no
longer needed.

Drop calls to pthread_setcancelstate().
Minor cleanup:
- when functions calls must be split into multiple lines,
  break the line for each parameter
- space between function name and parenthesized parameter list
Decouple sqlite pragma operations into separate blocks
for readability and more detailed logging on failure.
Problem: errors are not checked when the database is
closed.

Catch any errors and log them.
Problem: content-sqlite needlessly performs a save/restore
of errno around flux_future_destroy(), but that function
does not modify errno.

Drop the save/restore code.
Factor sqlite portion of load/store ooperations out of message handers
to their own functions.  This improves clarity and lets them be
used for other purposes.
Problem: t0011-content-cache.t produces many lines
consisting only of blobrefs on stdout.

Redirect test output to /dev/null.  It is not useful.
Drop comment about backing store reload that is no longer true.
Problem: message handler table is too scrunched up to
accept longer handler names without reformatting.

Spread out struct initialization to one field per line.
Problem: the content.backing RPC that both registers and
unregisters a backing store is awkward to use and badly
named.

Split into two RPCs: content.register-backing and
content.unregister-backing for clarity.

Update content-sqlite and sharness test.
Problem: if the backing store is unloaded with dirty cache
entries still present in the rank 0 cache, they can't
be saved, and this could result in data loss.

Log unflushables at LOG_ERR level so there is at least
a log message if that happens.
Problem: flux-content(1) describes old behavior of
content backing store module at unload time, where content
was copied from the backing store to the memory cache.
This is no longer done.

Update the description to reflect the way it works now.
Problem: In t2501-job-status.t, a job is submitted and killed once
the 'start' event has reached the job eventlog, but evidence from
Travis-CI runs indicates even this may be racy, since the test failed
reporting exit code 1 instead of 143 (SIGTERM).

Instead of waiting for just the 'start' event, wait for the 'shell.start'
event in the guest exec eventlog. This will ensure that all sleep
processes have been started, which should more reliably report the
expected exit status with `flux job status`.
@garlick
Copy link
Member Author

garlick commented Mar 2, 2020

Thanks, just forced a push with your commit tacked on.

@codecov-io
Copy link

Codecov Report

Merging #2786 into master will decrease coverage by 0.04%.
The diff coverage is 58.16%.

@@            Coverage Diff             @@
##           master    #2786      +/-   ##
==========================================
- Coverage   81.06%   81.02%   -0.05%     
==========================================
  Files         250      250              
  Lines       39428    39460      +32     
==========================================
+ Hits        31964    31971       +7     
- Misses       7464     7489      +25
Impacted Files Coverage Δ
src/modules/content-sqlite/content-sqlite.c 55% <56.39%> (-3.83%) ⬇️
src/broker/content-cache.c 65.22% <70.83%> (+0.54%) ⬆️
src/modules/job-info/list.c 66.49% <0%> (-2.07%) ⬇️
src/modules/job-info/guest_watch.c 75.57% <0%> (-0.58%) ⬇️
src/modules/job-info/job_state.c 70.06% <0%> (-0.47%) ⬇️
src/modules/job-exec/job-exec.c 74.72% <0%> (+0.22%) ⬆️
src/modules/job-info/watch.c 72.53% <0%> (+1.55%) ⬆️

@mergify mergify bot merged commit 7d4fc7c into flux-framework:master Mar 2, 2020
@garlick garlick deleted the content_cleanup branch March 3, 2020 00:10
@garlick garlick mentioned this pull request Mar 4, 2020
garlick added a commit to garlick/flux-core that referenced this pull request Mar 4, 2020
Problem: The 'sql_dump' prepared statement is unused,
left over from code removed in flux-framework#2786.

Remove unused prepared statement.
garlick added a commit to garlick/flux-core that referenced this pull request Mar 6, 2020
Problem: The 'sql_dump' prepared statement is unused,
left over from code removed in flux-framework#2786.

Remove unused prepared statement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants