Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

work around fluxion inbability to recover running jobs #4894

Merged
merged 4 commits into from
Jan 29, 2023

Conversation

garlick
Copy link
Member

@garlick garlick commented Jan 28, 2023

Here's a workaround for the problem's we have with fluxion refusing to start when running jobs are in the KVS, described in #4862.

In this one, a failure of the scheduler's hello callback causes a fatal scheduler-restart job exception to be posted rather than a scheduler tear-down. When that particular exception is raised, the job manager clears the flag that indicates that resources are allocated, so the sched.free RPC is skipped when the job enters cleanup.

This is a WIP pending writing tests.

Problem: if scheduler cannot reallocate resources to a running job,
the scheduler interface is torn down, requiring sys admin intervention.

This was seen in conjunction with flux-framework/flux-sched#992.

It's not really necessary for this to be fatal to the instance.
Raise a fatal exception on the job and let it be cleaned up in the
usual way.
Problem: after a scheduler-restart exception, the job manager
still thinks the job is holding resources that it must free.

Clear the has_resources flag on the job when this exception is
posted so that it won't get stuck in CLEANUP and/or confuse the
scheduler with a free request.  The resources are effectively
revoked.
garlick added a commit to garlick/flux-sched that referenced this pull request Jan 28, 2023
Problem: t1008-recovery-none.t expects the job manager to abort the
scheduler if a job fails to re-allocate resources during the hello
handshake, but this behavior will change soon.

Drop this test.  The behavior it is looking for will either be
addressed by a true fix to flux-framework#991 or the workaround proposed in
flux-framework/flux-core#4894.
garlick added a commit to garlick/flux-sched that referenced this pull request Jan 28, 2023
Problem: there is no test coverage for module reload with running
jobs and rv1_nosched.

Add test proposed by @grondo in flux-framework#991, expecting failure for now.

The test fails before and after the work-around proposed in
flux-framework/flux-core#4894 because it checks for both:
- qmanager reload fails (fails before the work-around)
- job resources remain allocated (fails after the work-around)

Increase the broker stderr log verbosity so the fatal job exceptions
generated by the work-around at LOG_INFO level are visible when
the test is run with -v.
@garlick
Copy link
Member Author

garlick commented Jan 28, 2023

If we can get flux-framework/flux-sched#1000 (ding ding ding! what did I win?) merged first, then this PR should start passing the sched CI test.

I'll try to add a test here that works with sched simple - something like

  • start a job
  • unload sched simple
  • exclude a node the job was using (at the resource level)
  • load sched simple
  • observe that the job received a fatal exception

Problem: sched-simple's "hello" callback doesn't return an error
if it cannot re-allocate the job's resources.

Return an error in this case so the job can be terminated with
a fatal exception.
Problem: there is no test coverage for a failure of the
sched hello callback.

Add a test that ensures a job whose resources cannot be re-allocated
receives a fatal exception.
@garlick garlick changed the title WIP: work around fluxion inbability to recover running jobs work around fluxion inbability to recover running jobs Jan 29, 2023
@codecov
Copy link

codecov bot commented Jan 29, 2023

Codecov Report

Merging #4894 (6ed2497) into master (1ff34c7) will increase coverage by 0.01%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #4894      +/-   ##
==========================================
+ Coverage   82.86%   82.87%   +0.01%     
==========================================
  Files         425      425              
  Lines       74814    74830      +16     
==========================================
+ Hits        61996    62019      +23     
+ Misses      12818    12811       -7     
Impacted Files Coverage Δ
src/common/libschedutil/hello.c 70.00% <80.00%> (+4.14%) ⬆️
src/modules/job-manager/event.c 74.41% <90.00%> (+0.35%) ⬆️
src/modules/sched-simple/sched.c 78.16% <100.00%> (+0.20%) ⬆️
src/modules/job-list/idsync.c 66.91% <0.00%> (-2.95%) ⬇️
src/modules/job-list/list.c 75.13% <0.00%> (-2.12%) ⬇️
src/common/libpmi/pmi.c 91.74% <0.00%> (-1.84%) ⬇️
src/common/libsubprocess/fork.c 75.38% <0.00%> (-1.54%) ⬇️
src/shell/info.c 76.16% <0.00%> (-1.04%) ⬇️
src/shell/shell.c 82.22% <0.00%> (-0.13%) ⬇️
... and 7 more

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice solution.

@garlick
Copy link
Member Author

garlick commented Jan 29, 2023

Thanks! I'll set MWP.

@mergify mergify bot merged commit 03e9941 into flux-framework:master Jan 29, 2023
@garlick garlick deleted the issue#4862 branch January 30, 2023 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants