Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager / job-exec: checkpoint and restore guest KVS namespaces #3947

Merged
merged 13 commits into from
Nov 20, 2021

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Nov 9, 2021

Thought I'd throw up this WIP if there are any high level thoughts / comments. This is an prototype set of changes for checkpointing KVS namespaces when a broker goes down and "re-attaching" to an already running job. What this PR does is:

A) upon unloading the job-exec module, checkpoint the rootref of the KVS namespaces of still running jobs
B) upon re-load, if the job is believed to still be running, send job-exec a "re-attach" flag from job-manager
C) upon re-load, have job-exec re-build KVS namespace with checkpointed rootref
D) upon "re-attach", testexec will simulate remaining runtime of the job based on the starttime of the job (i.e. perhaps testexec thinks the job has run 90 seconds out of 100 seconds, simulate a run for just 10 more seconds)

This currently works with tests like this:

flux mini submit --setattr=system.exec.test.run_duration=100s hostname

<wait a little>

# remove all job modules and KVS
flux module remove job-exec; flux module remove sched-simple; flux module remove job-list; flux module remove job-info; flux module remove job-manager; flux module remove job-ingest; flux module remove kvs

<wait a little>

# re-add KVS and all job modules
sleep 5; flux module load kvs; flux module load job-manager; flux module load job-info; flux module load job-list; flux module load job-ingest; flux module load job-exec; flux module load sched-simple

the eventlogs for the above test look like this (we would need RFC changes to document any new events we put in here). Some of the guest event log stuff is just for debugging / info.

>flux job eventlog f5xBPt79
1636493577.871553 submit userid=8556 urgency=16 flags=0
1636493577.885256 depend
1636493577.885322 priority priority=16
1636493577.887893 alloc annotations={"sched":{"resource_summary":"rank0/core0"}}
1636493577.890752 start
1636493590.477205 flux-reattach
1636493590.757046 reattach
1636493607.762088 finish status=0
1636493607.766412 release ranks="all" final=true
1636493607.767615 free
1636493607.767666 clean

>flux job eventlog -p guest.exec.eventlog f5xBPt79
1636493577.888897 init
1636493577.890580 starting
1636493577.890601 timer timerrun=30.0
1636493590.755381 reattach
1636493590.756492 re-starting
1636493590.756854 note timeleft=17
1636493590.756869 timer timerrun=17.0
1636493607.761802 timercb
1636493607.761841 complete status=0
1636493607.761891 done


need to cleanup code, try and make some code non-synchronous, need to make tests, but I think this is a first good step

@chu11 chu11 changed the title job-manager / job-exec: checkpoint and restore guest KVS namespaces [WIP] job-manager / job-exec: checkpoint and restore guest KVS namespaces Nov 10, 2021
@garlick
Copy link
Member

garlick commented Nov 10, 2021

Haven't had a chance to review but, yay! Nice!

@grondo
Copy link
Contributor

grondo commented Nov 10, 2021

Yeah, me either, but nice work getting us moving forward on this @chu11!

It seems like you are very close to "restarting" from an existing sqlite content backing store. Any idea what prevents that currently (i.e. why not restart an instance entirely from the same content.backing-path rather than unload/reload modules).

Excited to try this one out! (and double points for getting the testexec jobs to have the correct duration even after a reattach) 🙂

@chu11
Copy link
Member Author

chu11 commented Nov 10, 2021

It seems like you are very close to "restarting" from an existing sqlite content backing store. Any idea what prevents that currently (i.e. why not restart an instance entirely from the same content.backing-path rather than unload/reload modules).

It was mostly the fact that jobs are killed in rc1/rc3. I think its doable, but just haven't tried yet :-)

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch from 2bd4ebb to 17a0396 Compare November 11, 2021 22:09
@chu11
Copy link
Member Author

chu11 commented Nov 11, 2021

just re-pushed with various cleanup so the code isn't quite as big a mess. Also verified that having a set content store works too. i.e. this works

rm -f /tmp/achu/content.sqlite ; src/cmd/flux start -o,--setattr=content.backing-path=/tmp/achu//content.sqlite

flux mini submit --setattr=system.exec.test.run_duration=60s hostname

src/cmd/flux start -o,--setattr=content.backing-path=/tmp/achu//content.sqlite

And after this the job is still listed in flux jobs -A. Note that we have to modify etc/rc1 to remove the cancellation of jobs when exiting an instance. For tests I write, I'll add an environment variable or something like that if I need to tell rc1 to not do that.

Edit: added some tests too

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch from 569e1de to c079a0d Compare November 12, 2021 22:51
@chu11
Copy link
Member Author

chu11 commented Nov 12, 2021

re-pushed, just adding some tests and some instrumentation to try and get more test coverage

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch from c079a0d to 567fc3d Compare November 15, 2021 18:07
@chu11 chu11 changed the title [WIP] job-manager / job-exec: checkpoint and restore guest KVS namespaces job-manager / job-exec: checkpoint and restore guest KVS namespaces Nov 15, 2021
@chu11
Copy link
Member Author

chu11 commented Nov 15, 2021

removing WIP as I think this has reached a point of being reasonably reviewable

@garlick
Copy link
Member

garlick commented Nov 16, 2021

Just to double check before digging in - if the namespace already exists, we don't restore a potentially older rootref on top of it do we?

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch 2 times, most recently from 0cb4678 to 9249f51 Compare November 16, 2021 23:04
@chu11
Copy link
Member Author

chu11 commented Nov 16, 2021

Just to double check before digging in - if the namespace already exists, we don't restore a potentially older rootref on top of it do we?

No, if the namespace already exists you'll just get a normal EEXIST error upon the namespace create, I check for it and fallthrough to the rest of the normal setup code.

            if (!job->reattach || errno != EEXIST) {
                jobinfo_fatal_error (job, errno, "failed to create guest ns");
                goto done;
            }

although that is a good point that I should write a test for this.

@chu11
Copy link
Member Author

chu11 commented Nov 16, 2021

just re-pushed, doing some cleanup and prefixing some of the eventlog events with debug and only outputting them when job debugging is setup.

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch from 9249f51 to 1af7735 Compare November 17, 2021 04:16
@chu11
Copy link
Member Author

chu11 commented Nov 17, 2021

re-pushed with some minor cleanups, but most importantly added a test for when the guest KVS namespace is not destroyed during a "reattach".

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments, maybe not super helpful at this point. I'll make another pass in the morning when I have more than half a brain :-)

@@ -1094,10 +1094,170 @@ static int configure_implementations (flux_t *h, int argc, char **argv)
return 0;
}

static int unload_implementations (void)
static flux_future_t *lookup_namespace_roots (struct job_exec_ctx *ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way this code could be kept separate from job-exec.c with interfaces that would let it be reused more easily? Maybe it could be broken out to a separate file and take a zlistx_t of jobids fom zhashx_keys() rather than the job_exec_ctx?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think i can abstract it at that level of detail b/c the job->running flag has to be looked at to determine which jobs are running. but can definitely be abstracted more, perhaps just passing the jobs hash around.


if (flux_kvs_txn_put (txn,
0,
"job.checkpoint.running.namespaces",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this "belongs" to job-exec so put it under a job-exec. prefix? Probably not any need for the long path either, maybe job-exec.kvs-namespaces?

if (job->running) {
if (!fall) {
if (!(fall = flux_future_wait_all_create ())) {
flux_log_error (ctx->h, "flux_future_wait_all_create");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error messages could be improved in this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say "this commit" rather than "this function". The error messages all look like they not contain enough context to really run anything down.

json_t *nsdata = NULL;

if (!(nsdata = json_array ())) {
flux_log (ctx->h, LOG_ERR, "json_array");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe "out of memory"?

if ((job->flags & FLUX_JOB_DEBUG)) {
if (event_job_post_pack (ctx->event,
job,
"debug.flux-reattach",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is "flux" in the name?

At the moment I'm not finding the event that signifies completion but if it's still there, prob should balance the names, like debug.exec-reattach-start and debug.exec-reattach-finish. (those are the event suffixes used by prolog/epilog events)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, i was just following the flux-restart event from earlier. start and finish seem better.

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch 2 times, most recently from 5385406 to 8dbabef Compare November 18, 2021 20:56
@chu11
Copy link
Member Author

chu11 commented Nov 18, 2021

re-pushed with fixes based on comments above. I had to squash everything since the fixes didn't quite flow right with just fixups.

  • spliced out the checkpoint code into new files job-exec/checkpoint.[ch] and added better error messages
  • the debug.exec-reattach-start/finish events are added in a new commit, since they were previously across two different commits
  • new checkpoint path in the KVS

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch from 8dbabef to d0a26cd Compare November 18, 2021 22:08
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops started reviewing again but you are still updating. I'll just make these two comments and ask you let me know when you're ready.

Comment on lines 11 to 16
/* Prototype flux job exec service
*
* DESCRIPTION
*
* This module implements the exec interface as described in
* job-manager/start.c, but does not currently support execution of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Header cut & pasted from job-exec.c.

A paragraph or two describing what this does (and its prototype nature) would be good actually! (I got momentarily excited :-)

Comment on lines 108 to 36
if (!(fall = flux_future_wait_all_create ())) {
flux_log_error (h, "lookup_nsroots: "
"flux_future_wait_all_create");
goto cleanup;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all the error cases in this function are equally unlikely and uninteresting to distinguish. Suggest you simply log one message down in checkpoint_running() where the function is called, like "failed to initiate KVS requests for namespace checkpoint".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was going to make the same comment for the function after that one. Maybe whole file needs error logs audited.

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch from d0a26cd to 7743da7 Compare November 18, 2021 23:23
@chu11
Copy link
Member Author

chu11 commented Nov 18, 2021

Oops started reviewing again but you are still updating. I'll just make these two comments and ask you let me know when you're ready.

Oops, sorry! Shortly after I pushed I realized I didn't remove that cut & pasted header stuff. So I tried to sneak it their removal.

Updated the headers in checkpoint.[ch], and consolidated the log messages per your comments above.

Edit: crud, build is breaking, lemme fix
Edit2: ok think i fixed it

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to give you dribs and drabs. Apparently Sue ordered dinner and it's here so I'll be offline for an hour or so. Couple more comments for ya.

int rv = -1;

if (!nsdata || json_array_size (nsdata) <= 0)
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set errno

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, i can just remove this check now, the logic is different than some initial work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh duh, that's not an error is it.

Comment on lines 168 to 231
if (flux_future_wait_for (fall, -1.) < 0) {
flux_log_error (h, "kvs_checkpoint: flux_future_wait_for");
goto cleanup;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? If so, maybe combine error log with get_nsroots() since a failure here is also a failure to get ns root refs.

@@ -40,6 +40,7 @@ struct job {
uint8_t free_pending:1; // free request sent to sched
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the description, could you indicate that a boolean flag was added to the start request payload? it's kind of vague with the current wording. Also, it's for any job that's in the RUN state per its replayed event log, not a "belief" per se (you're giving job manager too much credit!)

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch from 95945d5 to a6de4ab Compare November 19, 2021 01:03
@chu11
Copy link
Member Author

chu11 commented Nov 19, 2021

re-pushed with some minor fixes per discussion above and offline.

  • checkpointing now also removes namespaces that were checkpointed.
  • minor tweaks to some checkpoint logic
  • don't output error when there is nothing to checkpoint (logic error I had earlier)
  • update some descriptions

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty close. Just one comment about location of source.

Comment on lines +866 to +869
static void get_rootref_cb (flux_future_t *fprev, void *arg)
{
int saved_errno;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this code for restoring namespaces be in the same source file as the code to save them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume you meant "removing namespaces". Good point, perhaps it should be in job-exec.c b/c that's where we create/setup namespaces too. I'll move it in there.

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch from a6de4ab to c27a8ae Compare November 19, 2021 20:25
@chu11
Copy link
Member Author

chu11 commented Nov 19, 2021

re-pushed, the remove of recently checkpointed namespaces is now within job-exec.c. I spliced it out into its own commit within the series

@garlick
Copy link
Member

garlick commented Nov 19, 2021

Well, my thought was that if checkpoint_running() in checkpoint.c writes the checkpoint object, that ns_get_rootref() which reads the checkpoint object should also be in checkpoint.c. That way the details of the checkpoint object wouldn't have to be exposed outside of that source file. Would that potentially work?

@chu11
Copy link
Member Author

chu11 commented Nov 19, 2021

Well, my thought was that if checkpoint_running() in checkpoint.c writes the checkpoint object, that ns_get_rootref() which reads the checkpoint object should also be in checkpoint.c. That way the details of the checkpoint object wouldn't have to be exposed outside of that source file. Would that potentially work?

Ohh sorry, I didn't know you were referring to that half of this PR. Yes, I think would work work and is a good idea too. Let me do that.

I do think that the "remove namespaces" can stay in job-exec though, since its independent of the checkpointing.

@garlick
Copy link
Member

garlick commented Nov 19, 2021

No my bad - I was not very clear about what code I was referring to!

@chu11 chu11 force-pushed the issue3811_kvscheckpoint branch from c27a8ae to e4b2967 Compare November 20, 2021 01:03
@chu11
Copy link
Member Author

chu11 commented Nov 20, 2021

re-pushed, I added

flux_future_t *checkpoint_get_rootrefs (flux_t *h);

char *checkpoint_find_rootref (flux_future_t *f,
                               flux_jobid_t id,
                               uint32_t owner);

as helper functions to get the checkpoint object and find a rootref within it. B/c of all of the flux future compositions / continuations, I didn't want to move that code into the checkpoint lib.

@garlick
Copy link
Member

garlick commented Nov 20, 2021

Thanks @chu11 - this looks good. Let's get this merged so we can keep the momentum up!

Nice work!

chu11 added 13 commits November 19, 2021 21:46
Problem: The jobinfo_started() function took a fmt parameter and
variable args, but never used them.

Solution: Remove the variable arguments options to jobinfo_started().
Adjust callers accordingly.
Problem: If the job-exec module is unloaded with running jobs, we have
no way to recreate the KVS namespaces for those jobs if we wish to
re-attach to them later.

Solution: Checkpoint the KVS root references for any KVS namespaces
of running jobs.
Problem: After checkpoint of running namespaces, we do not
want running jobs to continue to write to that guest namespace.

Solution: On job-exec unload, remove the namespaces we just
checkpointed.
Problem: When the job-manager is re-loaded discovers a job that's
in the RUN state, there is no way to inform the job exec module
that the job should still be running.

Solution: Add a "reattach" flag to the job-exec.start RPC.  This
flag informs the job-exec module that the job should still be
running.
Problem: The testexec exec implementation does not parse and handle
the "reattach" flag from the job-manager.

Solution: If the testexec implementation sees the "reattach" flag from
the job-manager, emulate that job is still running by running the job
for the remaining time it should given the job's start time.  Emit
a "re-starting" event indicating this restart and notify the
job-manager via a "reattach" event.
Problem: If re-attaching to an already running job, re-create the
KVS namespace based on the previously checkpointed root reference
for the job.
Problem: Job "reattach" is difficult to test at the moment.

Solution: Add reattach start and finish events to aid in testing.
Problem: rc1 cancels all jobs when an instance is exited, but that may
not be desireable all of the time, such as some testing scenarios.

Solution: Support an environment variable FLUX_INSTANCE_RESTART to notify
rc1 that we instance restart is occurring and to not cancel jobs upon
instance shutdown.
Problem: Under test scenarios, it may be difficult to reattach
to a job that "already ended".

Solution: Support a flag that will assume a reattached job has
already finished and will go through the normal process for an
already completed job.
Add initial tests to see that jobs can survive instance restarts
using the job-exec testexec execution plugin.
@codecov
Copy link

codecov bot commented Nov 20, 2021

Codecov Report

Merging #3947 (61df254) into master (90cc9c8) will decrease coverage by 0.04%.
The diff coverage is 76.70%.

@@            Coverage Diff             @@
##           master    #3947      +/-   ##
==========================================
- Coverage   83.54%   83.49%   -0.05%     
==========================================
  Files         358      359       +1     
  Lines       52764    52987     +223     
==========================================
+ Hits        44080    44240     +160     
- Misses       8684     8747      +63     
Impacted Files Coverage Δ
src/modules/job-manager/restart.c 83.45% <71.42%> (-0.01%) ⬇️
src/modules/job-exec/checkpoint.c 72.38% <72.38%> (ø)
src/modules/job-exec/testexec.c 85.07% <74.54%> (-2.83%) ⬇️
src/modules/job-exec/job-exec.c 76.67% <83.56%> (+0.50%) ⬆️
src/modules/job-manager/start.c 73.33% <87.50%> (+0.52%) ⬆️
src/modules/job-exec/exec.c 79.28% <100.00%> (ø)
src/broker/overlay.c 87.44% <0.00%> (-1.25%) ⬇️
src/broker/broker.c 76.58% <0.00%> (-0.41%) ⬇️
src/cmd/flux-module.c 80.83% <0.00%> (+0.34%) ⬆️
... and 3 more

@mergify mergify bot merged commit 562d29b into flux-framework:master Nov 20, 2021
@chu11 chu11 deleted the issue3811_kvscheckpoint branch December 17, 2021 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants