work around fluxion inbability to recover running jobs #4894

garlick · 2023-01-28T00:18:37Z

Here's a workaround for the problem's we have with fluxion refusing to start when running jobs are in the KVS, described in #4862.

In this one, a failure of the scheduler's hello callback causes a fatal scheduler-restart job exception to be posted rather than a scheduler tear-down. When that particular exception is raised, the job manager clears the flag that indicates that resources are allocated, so the sched.free RPC is skipped when the job enters cleanup.

This is a WIP pending writing tests.

Problem: if scheduler cannot reallocate resources to a running job, the scheduler interface is torn down, requiring sys admin intervention. This was seen in conjunction with flux-framework/flux-sched#992. It's not really necessary for this to be fatal to the instance. Raise a fatal exception on the job and let it be cleaned up in the usual way.

Problem: after a scheduler-restart exception, the job manager still thinks the job is holding resources that it must free. Clear the has_resources flag on the job when this exception is posted so that it won't get stuck in CLEANUP and/or confuse the scheduler with a free request. The resources are effectively revoked.

Problem: t1008-recovery-none.t expects the job manager to abort the scheduler if a job fails to re-allocate resources during the hello handshake, but this behavior will change soon. Drop this test. The behavior it is looking for will either be addressed by a true fix to flux-framework#991 or the workaround proposed in flux-framework/flux-core#4894.

@grondo

Problem: there is no test coverage for module reload with running jobs and rv1_nosched. Add test proposed by @grondo in flux-framework#991, expecting failure for now. The test fails before and after the work-around proposed in flux-framework/flux-core#4894 because it checks for both: - qmanager reload fails (fails before the work-around) - job resources remain allocated (fails after the work-around) Increase the broker stderr log verbosity so the fatal job exceptions generated by the work-around at LOG_INFO level are visible when the test is run with -v.

garlick · 2023-01-28T17:36:34Z

If we can get flux-framework/flux-sched#1000 (ding ding ding! what did I win?) merged first, then this PR should start passing the sched CI test.

I'll try to add a test here that works with sched simple - something like

start a job
unload sched simple
exclude a node the job was using (at the resource level)
load sched simple
observe that the job received a fatal exception

Problem: sched-simple's "hello" callback doesn't return an error if it cannot re-allocate the job's resources. Return an error in this case so the job can be terminated with a fatal exception.

Problem: there is no test coverage for a failure of the sched hello callback. Add a test that ensures a job whose resources cannot be re-allocated receives a fatal exception.

codecov · 2023-01-29T05:45:14Z

Codecov Report

Merging #4894 (6ed2497) into master (1ff34c7) will increase coverage by 0.01%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #4894      +/-   ##
==========================================
+ Coverage   82.86%   82.87%   +0.01%     
==========================================
  Files         425      425              
  Lines       74814    74830      +16     
==========================================
+ Hits        61996    62019      +23     
+ Misses      12818    12811       -7

Impacted Files	Coverage Δ
src/common/libschedutil/hello.c	`70.00% <80.00%> (+4.14%)`	⬆️
src/modules/job-manager/event.c	`74.41% <90.00%> (+0.35%)`	⬆️
src/modules/sched-simple/sched.c	`78.16% <100.00%> (+0.20%)`	⬆️
src/modules/job-list/idsync.c	`66.91% <0.00%> (-2.95%)`	⬇️
src/modules/job-list/list.c	`75.13% <0.00%> (-2.12%)`	⬇️
src/common/libpmi/pmi.c	`91.74% <0.00%> (-1.84%)`	⬇️
src/common/libsubprocess/fork.c	`75.38% <0.00%> (-1.54%)`	⬇️
src/shell/info.c	`76.16% <0.00%> (-1.04%)`	⬇️
src/shell/shell.c	`82.22% <0.00%> (-0.13%)`	⬇️
... and 7 more

grondo

LGTM! Nice solution.

garlick · 2023-01-29T20:52:52Z

Thanks! I'll set MWP.

garlick added 2 commits January 27, 2023 15:53

garlick mentioned this pull request Jan 28, 2023

testsuite: adjust expectations of recovery in rv1_nosched mode flux-framework/flux-sched#1000

Merged

garlick added 2 commits January 28, 2023 21:01

sched-simple: return error on hello failure

9b45dd8

Problem: sched-simple's "hello" callback doesn't return an error if it cannot re-allocate the job's resources. Return an error in this case so the job can be terminated with a fatal exception.

testsuite: add test for sched hello failure

6ed2497

Problem: there is no test coverage for a failure of the sched hello callback. Add a test that ensures a job whose resources cannot be re-allocated receives a fatal exception.

garlick changed the title ~~WIP: work around fluxion inbability to recover running jobs~~ work around fluxion inbability to recover running jobs Jan 29, 2023

grondo approved these changes Jan 29, 2023

View reviewed changes

garlick added the merge-when-passing label Jan 29, 2023

mergify bot merged commit 03e9941 into flux-framework:master Jan 29, 2023

garlick mentioned this pull request Jan 29, 2023

crashing with running jobs can lead to problems #4862

Closed

garlick deleted the issue#4862 branch January 30, 2023 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

work around fluxion inbability to recover running jobs #4894

work around fluxion inbability to recover running jobs #4894

garlick commented Jan 28, 2023

garlick commented Jan 28, 2023

codecov bot commented Jan 29, 2023

grondo left a comment

garlick commented Jan 29, 2023

work around fluxion inbability to recover running jobs #4894

work around fluxion inbability to recover running jobs #4894

Conversation

garlick commented Jan 28, 2023

garlick commented Jan 28, 2023

codecov bot commented Jan 29, 2023

Codecov Report

grondo left a comment

Choose a reason for hiding this comment

garlick commented Jan 29, 2023