Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: improve robustness of max job id recovery on restart #4443

Merged
merged 5 commits into from
Jul 28, 2022

Conversation

garlick
Copy link
Member

@garlick garlick commented Jul 27, 2022

Problem: as noted in #4300, the job-manager only writes out the max job ID to the KVS when it is unloading, so despite best efforts to preserve KVS content (e.g. with defensive checkpoints added in #4383) when there is a crash or a failed shutdown, we are in danger of starting from an old max ID value and issuing duplicate job IDs.

This just uses the greater of the checkpoint value or the maximum job ID replayed from the KVS to improve robustness.

Also, since purging can eliminate all jobs in the KVS, we add code to update the checkpoint value in the KVS if the job matching the current max job ID happens to be purged, as suggested by @grondo.

@garlick garlick changed the title job-manager: improve robustness of max job id recovery on restat job-manager: improve robustness of max job id recovery on restart Jul 28, 2022
@@ -20,6 +20,8 @@ int restart_from_kvs (struct job_manager *ctx);
/* exposed for unit testing only */
int restart_count_char (const char *s, char c);

int checkpoint_to_txn (struct job_manager *ctx, flux_kvs_txn_t *txn);
Copy link
Member

@chu11 chu11 Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's just me, but this sounds like you're checkpointing into the txn? which makes no sense. Hmmm.

save/write/add_checkpoint_to_txn()? I'd say add is the best of this bunch.
txn_checkpoint_max_jobid()?

maybe there's a better one.

Copy link
Member Author

@garlick garlick Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't too happy with the function names either. OK, I tacked on an earlier commit that renames checkpoint_to_kvs() to restart_save_state() and then made the new function restart_save_state_to_txn(). Hopefully that works a little better, and also normalizes all the public functions in restart.c to have a restart_* prefix, and avoids the use of "checkpoint" which recently has become a bit of a loaded term.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM. Just a couple minor comments inline (one is more of a question)

@@ -219,8 +219,8 @@ int mod_main (flux_t *h, int argc, char **argv)
flux_log_error (h, "flux_reactor_run");
goto done;
}
if (checkpoint_to_kvs (&ctx) < 0) {
flux_log_error (h, "checkpoint_to_kvs");
if (restart_save_state (&ctx) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message typo: "confusign"

@@ -197,6 +197,8 @@ static int restart_map_cb (struct job *job, void *arg, flux_error_t *error)
(uintmax_t)job->id);
return -1;
}
if (ctx->max_jobid < job->id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the previous rename of "checkpoint" to "restart", should/could the commit message here be updated too?

@garlick
Copy link
Member Author

garlick commented Jul 28, 2022

Thanks! I fixed those things and forced a push. I'll set MWP.

garlick added 5 commits July 28, 2022 17:02
Problem: the names of functions to save/restore state are
confusing and use a prefix that doesn't match other public
functions in the same source module.

The term "checkpoint" is somewhat loaded now.

Rename
 checkpoint_to_kvs() -> restart_save_state()
 checkpoint_restore() -> restart_restore_state()

Update error messages to avoid the term "checkpoint" and
eliminate a useless wrapper function.
Problem: the KVS checkpoint.job_manager key that contains the max
job ID is only written out when the job manager is unloaded.  If the
unload fails, or if the broker crashes before then, an older value
may be used and job IDs could be reissued.

Track the max job id encountered during KVS job replay, and use the
greater of the replayed job IDs or the saved max job ID value.

Fixes flux-framework#4300
Problem: it is inconvenient to query the max_jobid from the
job manager.

Add max_jobid to the custom module stats object, so it can be
queried with 'flux module stats' from test scripts.
Problem: the max job ID issued thus far is redundantly available
from the job manager KVS checkpoint.job_manager key (written at
module unload) and from scanning the KVS job history during restart,
but purging can eliminate the latter option.

Refactor function that synchronously saves state to expose a new
function restart_save_state_to_txn().

Update checkpoint.job_manager whenever the job matching ctx->max_jobid
is purged.
Problem: methods for ensuring the job manager starts from
the newest-issued job ID have no test coverage.

Add some tests to t22109-job-manager-restart.t.
@codecov
Copy link

codecov bot commented Jul 28, 2022

Codecov Report

Merging #4443 (de22d63) into master (2086712) will increase coverage by 3.40%.
The diff coverage is 84.00%.

❗ Current head de22d63 differs from pull request most recent head f164876. Consider uploading reports for the commit f164876 to get more accurate results

@@            Coverage Diff             @@
##           master    #4443      +/-   ##
==========================================
+ Coverage   79.94%   83.34%   +3.40%     
==========================================
  Files         398      398              
  Lines       66699    67119     +420     
==========================================
+ Hits        53322    55941    +2619     
+ Misses      13377    11178    -2199     
Impacted Files Coverage Δ
src/modules/job-manager/job-manager.c 62.26% <66.66%> (ø)
src/modules/job-manager/purge.c 83.33% <66.66%> (+6.00%) ⬆️
src/modules/job-manager/restart.c 77.00% <89.47%> (+6.28%) ⬆️
src/bindings/python/flux/job/JobID.py 85.71% <0.00%> (-14.29%) ⬇️
src/bindings/python/flux/memoized_property.py 88.88% <0.00%> (-11.12%) ⬇️
src/bindings/python/flux/message.py 76.04% <0.00%> (-8.85%) ⬇️
src/bindings/python/flux/resource/ResourceSet.py 88.99% <0.00%> (-8.01%) ⬇️
src/bindings/python/flux/job/wait.py 89.13% <0.00%> (-6.22%) ⬇️
src/bindings/python/flux/job/Jobspec.py 83.07% <0.00%> (-5.70%) ⬇️
src/bindings/python/flux/resource/list.py 88.57% <0.00%> (-5.37%) ⬇️
... and 212 more

@mergify mergify bot merged commit 246e37f into flux-framework:master Jul 28, 2022
@garlick garlick deleted the issue#4300 branch July 28, 2022 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants