job-manager: improve robustness of max job id recovery on restart #4443

garlick · 2022-07-27T22:19:59Z

Problem: as noted in #4300, the job-manager only writes out the max job ID to the KVS when it is unloading, so despite best efforts to preserve KVS content (e.g. with defensive checkpoints added in #4383) when there is a crash or a failed shutdown, we are in danger of starting from an old max ID value and issuing duplicate job IDs.

This just uses the greater of the checkpoint value or the maximum job ID replayed from the KVS to improve robustness.

Also, since purging can eliminate all jobs in the KVS, we add code to update the checkpoint value in the KVS if the job matching the current max job ID happens to be purged, as suggested by @grondo.

chu11 · 2022-07-28T04:20:53Z

src/modules/job-manager/restart.h

@@ -20,6 +20,8 @@ int restart_from_kvs (struct job_manager *ctx);
 /* exposed for unit testing only */
 int restart_count_char (const char *s, char c);

+int checkpoint_to_txn (struct job_manager *ctx, flux_kvs_txn_t *txn);


maybe it's just me, but this sounds like you're checkpointing into the txn? which makes no sense. Hmmm.

save/write/add_checkpoint_to_txn()? I'd say add is the best of this bunch.
txn_checkpoint_max_jobid()?

maybe there's a better one.

Yeah I wasn't too happy with the function names either. OK, I tacked on an earlier commit that renames checkpoint_to_kvs() to restart_save_state() and then made the new function restart_save_state_to_txn(). Hopefully that works a little better, and also normalizes all the public functions in restart.c to have a restart_* prefix, and avoids the use of "checkpoint" which recently has become a bit of a loaded term.

grondo

Nice! LGTM. Just a couple minor comments inline (one is more of a question)

grondo · 2022-07-28T15:37:40Z

src/modules/job-manager/job-manager.c

@@ -219,8 +219,8 @@ int mod_main (flux_t *h, int argc, char **argv)
        flux_log_error (h, "flux_reactor_run");
        goto done;
    }
-    if (checkpoint_to_kvs (&ctx) < 0) {
-        flux_log_error (h, "checkpoint_to_kvs");
+    if (restart_save_state (&ctx) < 0) {


Commit message typo: "confusign"

grondo · 2022-07-28T15:39:03Z

src/modules/job-manager/restart.c

@@ -197,6 +197,8 @@ static int restart_map_cb (struct job *job, void *arg, flux_error_t *error)
                   (uintmax_t)job->id);
        return -1;
    }
+    if (ctx->max_jobid < job->id)


Given the previous rename of "checkpoint" to "restart", should/could the commit message here be updated too?

garlick · 2022-07-28T16:14:34Z

Thanks! I fixed those things and forced a push. I'll set MWP.

Problem: the names of functions to save/restore state are confusing and use a prefix that doesn't match other public functions in the same source module. The term "checkpoint" is somewhat loaded now. Rename checkpoint_to_kvs() -> restart_save_state() checkpoint_restore() -> restart_restore_state() Update error messages to avoid the term "checkpoint" and eliminate a useless wrapper function.

Problem: the KVS checkpoint.job_manager key that contains the max job ID is only written out when the job manager is unloaded. If the unload fails, or if the broker crashes before then, an older value may be used and job IDs could be reissued. Track the max job id encountered during KVS job replay, and use the greater of the replayed job IDs or the saved max job ID value. Fixes flux-framework#4300

Problem: it is inconvenient to query the max_jobid from the job manager. Add max_jobid to the custom module stats object, so it can be queried with 'flux module stats' from test scripts.

Problem: the max job ID issued thus far is redundantly available from the job manager KVS checkpoint.job_manager key (written at module unload) and from scanning the KVS job history during restart, but purging can eliminate the latter option. Refactor function that synchronously saves state to expose a new function restart_save_state_to_txn(). Update checkpoint.job_manager whenever the job matching ctx->max_jobid is purged.

Problem: methods for ensuring the job manager starts from the newest-issued job ID have no test coverage. Add some tests to t22109-job-manager-restart.t.

codecov · 2022-07-28T17:05:30Z

Codecov Report

Merging #4443 (de22d63) into master (2086712) will increase coverage by 3.40%.
The diff coverage is 84.00%.

❗ Current head de22d63 differs from pull request most recent head f164876. Consider uploading reports for the commit f164876 to get more accurate results

@@            Coverage Diff             @@
##           master    #4443      +/-   ##
==========================================
+ Coverage   79.94%   83.34%   +3.40%     
==========================================
  Files         398      398              
  Lines       66699    67119     +420     
==========================================
+ Hits        53322    55941    +2619     
+ Misses      13377    11178    -2199

Impacted Files	Coverage Δ
src/modules/job-manager/job-manager.c	`62.26% <66.66%> (ø)`
src/modules/job-manager/purge.c	`83.33% <66.66%> (+6.00%)`	⬆️
src/modules/job-manager/restart.c	`77.00% <89.47%> (+6.28%)`	⬆️
src/bindings/python/flux/job/JobID.py	`85.71% <0.00%> (-14.29%)`	⬇️
src/bindings/python/flux/memoized_property.py	`88.88% <0.00%> (-11.12%)`	⬇️
src/bindings/python/flux/message.py	`76.04% <0.00%> (-8.85%)`	⬇️
src/bindings/python/flux/resource/ResourceSet.py	`88.99% <0.00%> (-8.01%)`	⬇️
src/bindings/python/flux/job/wait.py	`89.13% <0.00%> (-6.22%)`	⬇️
src/bindings/python/flux/job/Jobspec.py	`83.07% <0.00%> (-5.70%)`	⬇️
src/bindings/python/flux/resource/list.py	`88.57% <0.00%> (-5.37%)`	⬇️
... and 212 more

garlick changed the title ~~job-manager: improve robustness of max job id recovery on restat~~ job-manager: improve robustness of max job id recovery on restart Jul 28, 2022

chu11 reviewed Jul 28, 2022

View reviewed changes

garlick force-pushed the issue#4300 branch from a1d9cae to b2abc72 Compare July 28, 2022 15:11

grondo approved these changes Jul 28, 2022

View reviewed changes

garlick force-pushed the issue#4300 branch from b2abc72 to de22d63 Compare July 28, 2022 16:14

garlick added the merge-when-passing label Jul 28, 2022

garlick added 5 commits July 28, 2022 17:02

job-manager: add max_jobid to module stats

fac49c0

Problem: it is inconvenient to query the max_jobid from the job manager. Add max_jobid to the custom module stats object, so it can be queried with 'flux module stats' from test scripts.

testsuite: cover job manager restart jobid

f164876

Problem: methods for ensuring the job manager starts from the newest-issued job ID have no test coverage. Add some tests to t22109-job-manager-restart.t.

garlick force-pushed the issue#4300 branch from de22d63 to f164876 Compare July 28, 2022 17:02

mergify bot merged commit 246e37f into flux-framework:master Jul 28, 2022

garlick deleted the issue#4300 branch July 28, 2022 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-manager: improve robustness of max job id recovery on restart #4443

job-manager: improve robustness of max job id recovery on restart #4443

garlick commented Jul 27, 2022 •

edited

Loading

chu11 Jul 28, 2022 •

edited

Loading

garlick Jul 28, 2022 •

edited

Loading

grondo left a comment

grondo Jul 28, 2022

grondo Jul 28, 2022

garlick commented Jul 28, 2022

codecov bot commented Jul 28, 2022

job-manager: improve robustness of max job id recovery on restart #4443

job-manager: improve robustness of max job id recovery on restart #4443

Conversation

garlick commented Jul 27, 2022 • edited Loading

chu11 Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

garlick Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

grondo left a comment

Choose a reason for hiding this comment

grondo Jul 28, 2022

Choose a reason for hiding this comment

grondo Jul 28, 2022

Choose a reason for hiding this comment

garlick commented Jul 28, 2022

codecov bot commented Jul 28, 2022

Codecov Report

garlick commented Jul 27, 2022 •

edited

Loading

chu11 Jul 28, 2022 •

edited

Loading

garlick Jul 28, 2022 •

edited

Loading