Skip to content

Commit

Permalink
libschedutil: handle hello failure gracefully
Browse files Browse the repository at this point in the history
Problem: if scheduler cannot reallocate resources to a running job,
the scheduler interface is torn down, requiring sys admin intervention.

This was seen in conjunction with flux-framework/flux-sched#992.

It's not really necessary for this to be fatal to the instance.
Raise a fatal exception on the job and let it be cleaned up in the
usual way.
  • Loading branch information
garlick committed Jan 27, 2023
1 parent 1ff34c7 commit 01d56a2
Showing 1 changed file with 21 additions and 1 deletion.
22 changes: 21 additions & 1 deletion src/common/libschedutil/hello.c
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,26 @@
#include "init.h"
#include "hello.h"


static void raise_exception (flux_t *h, flux_jobid_t id, const char *note)
{
flux_future_t *f;

flux_log (h,
LOG_INFO,
"raising fatal exception on running job id=%ju",
(uintmax_t)id);

if (!(f = flux_job_raise (h, id, "scheduler-restart", 0, note))
|| flux_future_get (f, NULL) < 0) {
flux_log_error (h,
"error raising fatal exception on %ju: %s",
(uintmax_t)id,
future_strerror (f, errno));
}
flux_future_destroy (f);
}

static int schedutil_hello_job (schedutil_t *util,
const flux_msg_t *msg)
{
Expand All @@ -40,7 +60,7 @@ static int schedutil_hello_job (schedutil_t *util,
msg,
R,
util->cb_arg) < 0)
goto error;
raise_exception (util->h, id, "failed to reallocate R for running job");
flux_future_destroy (f);
return 0;
error:
Expand Down

0 comments on commit 01d56a2

Please sign in to comment.