-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
work around fluxion inbability to recover running jobs #4894
Conversation
Problem: if scheduler cannot reallocate resources to a running job, the scheduler interface is torn down, requiring sys admin intervention. This was seen in conjunction with flux-framework/flux-sched#992. It's not really necessary for this to be fatal to the instance. Raise a fatal exception on the job and let it be cleaned up in the usual way.
Problem: after a scheduler-restart exception, the job manager still thinks the job is holding resources that it must free. Clear the has_resources flag on the job when this exception is posted so that it won't get stuck in CLEANUP and/or confuse the scheduler with a free request. The resources are effectively revoked.
Problem: t1008-recovery-none.t expects the job manager to abort the scheduler if a job fails to re-allocate resources during the hello handshake, but this behavior will change soon. Drop this test. The behavior it is looking for will either be addressed by a true fix to flux-framework#991 or the workaround proposed in flux-framework/flux-core#4894.
Problem: there is no test coverage for module reload with running jobs and rv1_nosched. Add test proposed by @grondo in flux-framework#991, expecting failure for now. The test fails before and after the work-around proposed in flux-framework/flux-core#4894 because it checks for both: - qmanager reload fails (fails before the work-around) - job resources remain allocated (fails after the work-around) Increase the broker stderr log verbosity so the fatal job exceptions generated by the work-around at LOG_INFO level are visible when the test is run with -v.
If we can get flux-framework/flux-sched#1000 (ding ding ding! what did I win?) merged first, then this PR should start passing the sched CI test. I'll try to add a test here that works with sched simple - something like
|
Problem: sched-simple's "hello" callback doesn't return an error if it cannot re-allocate the job's resources. Return an error in this case so the job can be terminated with a fatal exception.
Problem: there is no test coverage for a failure of the sched hello callback. Add a test that ensures a job whose resources cannot be re-allocated receives a fatal exception.
Codecov Report
@@ Coverage Diff @@
## master #4894 +/- ##
==========================================
+ Coverage 82.86% 82.87% +0.01%
==========================================
Files 425 425
Lines 74814 74830 +16
==========================================
+ Hits 61996 62019 +23
+ Misses 12818 12811 -7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice solution.
Thanks! I'll set MWP. |
Here's a workaround for the problem's we have with fluxion refusing to start when running jobs are in the KVS, described in #4862.
In this one, a failure of the scheduler's
hello
callback causes a fatalscheduler-restart
job exception to be posted rather than a scheduler tear-down. When that particular exception is raised, the job manager clears the flag that indicates that resources are allocated, so thesched.free
RPC is skipped when the job enters cleanup.This is a WIP pending writing tests.