-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-manager: improve robustness of max job id recovery on restart #4443
Conversation
src/modules/job-manager/restart.h
Outdated
@@ -20,6 +20,8 @@ int restart_from_kvs (struct job_manager *ctx); | |||
/* exposed for unit testing only */ | |||
int restart_count_char (const char *s, char c); | |||
|
|||
int checkpoint_to_txn (struct job_manager *ctx, flux_kvs_txn_t *txn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's just me, but this sounds like you're checkpointing into the txn? which makes no sense. Hmmm.
save/write/add_checkpoint_to_txn()
? I'd say add is the best of this bunch.
txn_checkpoint_max_jobid()
?
maybe there's a better one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wasn't too happy with the function names either. OK, I tacked on an earlier commit that renames checkpoint_to_kvs()
to restart_save_state()
and then made the new function restart_save_state_to_txn()
. Hopefully that works a little better, and also normalizes all the public functions in restart.c to have a restart_*
prefix, and avoids the use of "checkpoint" which recently has become a bit of a loaded term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM. Just a couple minor comments inline (one is more of a question)
@@ -219,8 +219,8 @@ int mod_main (flux_t *h, int argc, char **argv) | |||
flux_log_error (h, "flux_reactor_run"); | |||
goto done; | |||
} | |||
if (checkpoint_to_kvs (&ctx) < 0) { | |||
flux_log_error (h, "checkpoint_to_kvs"); | |||
if (restart_save_state (&ctx) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message typo: "confusign"
@@ -197,6 +197,8 @@ static int restart_map_cb (struct job *job, void *arg, flux_error_t *error) | |||
(uintmax_t)job->id); | |||
return -1; | |||
} | |||
if (ctx->max_jobid < job->id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the previous rename of "checkpoint" to "restart", should/could the commit message here be updated too?
Thanks! I fixed those things and forced a push. I'll set MWP. |
Problem: the names of functions to save/restore state are confusing and use a prefix that doesn't match other public functions in the same source module. The term "checkpoint" is somewhat loaded now. Rename checkpoint_to_kvs() -> restart_save_state() checkpoint_restore() -> restart_restore_state() Update error messages to avoid the term "checkpoint" and eliminate a useless wrapper function.
Problem: the KVS checkpoint.job_manager key that contains the max job ID is only written out when the job manager is unloaded. If the unload fails, or if the broker crashes before then, an older value may be used and job IDs could be reissued. Track the max job id encountered during KVS job replay, and use the greater of the replayed job IDs or the saved max job ID value. Fixes flux-framework#4300
Problem: it is inconvenient to query the max_jobid from the job manager. Add max_jobid to the custom module stats object, so it can be queried with 'flux module stats' from test scripts.
Problem: the max job ID issued thus far is redundantly available from the job manager KVS checkpoint.job_manager key (written at module unload) and from scanning the KVS job history during restart, but purging can eliminate the latter option. Refactor function that synchronously saves state to expose a new function restart_save_state_to_txn(). Update checkpoint.job_manager whenever the job matching ctx->max_jobid is purged.
Problem: methods for ensuring the job manager starts from the newest-issued job ID have no test coverage. Add some tests to t22109-job-manager-restart.t.
Codecov Report
@@ Coverage Diff @@
## master #4443 +/- ##
==========================================
+ Coverage 79.94% 83.34% +3.40%
==========================================
Files 398 398
Lines 66699 67119 +420
==========================================
+ Hits 53322 55941 +2619
+ Misses 13377 11178 -2199
|
Problem: as noted in #4300, the job-manager only writes out the max job ID to the KVS when it is unloading, so despite best efforts to preserve KVS content (e.g. with defensive checkpoints added in #4383) when there is a crash or a failed shutdown, we are in danger of starting from an old max ID value and issuing duplicate job IDs.
This just uses the greater of the checkpoint value or the maximum job ID replayed from the KVS to improve robustness.
Also, since purging can eliminate all jobs in the KVS, we add code to update the checkpoint value in the KVS if the job matching the current max job ID happens to be purged, as suggested by @grondo.