job-manager / job-exec: checkpoint and restore guest KVS namespaces #3947

chu11 · 2021-11-09T21:32:19Z

Thought I'd throw up this WIP if there are any high level thoughts / comments. This is an prototype set of changes for checkpointing KVS namespaces when a broker goes down and "re-attaching" to an already running job. What this PR does is:

A) upon unloading the job-exec module, checkpoint the rootref of the KVS namespaces of still running jobs
B) upon re-load, if the job is believed to still be running, send job-exec a "re-attach" flag from job-manager
C) upon re-load, have job-exec re-build KVS namespace with checkpointed rootref
D) upon "re-attach", testexec will simulate remaining runtime of the job based on the starttime of the job (i.e. perhaps testexec thinks the job has run 90 seconds out of 100 seconds, simulate a run for just 10 more seconds)

This currently works with tests like this:

flux mini submit --setattr=system.exec.test.run_duration=100s hostname

<wait a little>

# remove all job modules and KVS
flux module remove job-exec; flux module remove sched-simple; flux module remove job-list; flux module remove job-info; flux module remove job-manager; flux module remove job-ingest; flux module remove kvs

<wait a little>

# re-add KVS and all job modules
sleep 5; flux module load kvs; flux module load job-manager; flux module load job-info; flux module load job-list; flux module load job-ingest; flux module load job-exec; flux module load sched-simple

the eventlogs for the above test look like this (we would need RFC changes to document any new events we put in here). Some of the guest event log stuff is just for debugging / info.

>flux job eventlog f5xBPt79
1636493577.871553 submit userid=8556 urgency=16 flags=0
1636493577.885256 depend
1636493577.885322 priority priority=16
1636493577.887893 alloc annotations={"sched":{"resource_summary":"rank0/core0"}}
1636493577.890752 start
1636493590.477205 flux-reattach
1636493590.757046 reattach
1636493607.762088 finish status=0
1636493607.766412 release ranks="all" final=true
1636493607.767615 free
1636493607.767666 clean

>flux job eventlog -p guest.exec.eventlog f5xBPt79
1636493577.888897 init
1636493577.890580 starting
1636493577.890601 timer timerrun=30.0
1636493590.755381 reattach
1636493590.756492 re-starting
1636493590.756854 note timeleft=17
1636493590.756869 timer timerrun=17.0
1636493607.761802 timercb
1636493607.761841 complete status=0
1636493607.761891 done

need to cleanup code, try and make some code non-synchronous, need to make tests, but I think this is a first good step

garlick · 2021-11-10T05:55:11Z

Haven't had a chance to review but, yay! Nice!

grondo · 2021-11-10T15:47:00Z

Yeah, me either, but nice work getting us moving forward on this @chu11!

It seems like you are very close to "restarting" from an existing sqlite content backing store. Any idea what prevents that currently (i.e. why not restart an instance entirely from the same content.backing-path rather than unload/reload modules).

Excited to try this one out! (and double points for getting the testexec jobs to have the correct duration even after a reattach) 🙂

chu11 · 2021-11-10T20:34:11Z

It seems like you are very close to "restarting" from an existing sqlite content backing store. Any idea what prevents that currently (i.e. why not restart an instance entirely from the same content.backing-path rather than unload/reload modules).

It was mostly the fact that jobs are killed in rc1/rc3. I think its doable, but just haven't tried yet :-)

chu11 · 2021-11-11T22:18:52Z

just re-pushed with various cleanup so the code isn't quite as big a mess. Also verified that having a set content store works too. i.e. this works

rm -f /tmp/achu/content.sqlite ; src/cmd/flux start -o,--setattr=content.backing-path=/tmp/achu//content.sqlite

flux mini submit --setattr=system.exec.test.run_duration=60s hostname

src/cmd/flux start -o,--setattr=content.backing-path=/tmp/achu//content.sqlite

And after this the job is still listed in flux jobs -A. Note that we have to modify etc/rc1 to remove the cancellation of jobs when exiting an instance. For tests I write, I'll add an environment variable or something like that if I need to tell rc1 to not do that.

Edit: added some tests too

chu11 · 2021-11-12T22:51:56Z

re-pushed, just adding some tests and some instrumentation to try and get more test coverage

chu11 · 2021-11-15T18:57:05Z

removing WIP as I think this has reached a point of being reasonably reviewable

garlick · 2021-11-16T22:28:37Z

Just to double check before digging in - if the namespace already exists, we don't restore a potentially older rootref on top of it do we?

chu11 · 2021-11-16T23:05:18Z

Just to double check before digging in - if the namespace already exists, we don't restore a potentially older rootref on top of it do we?

No, if the namespace already exists you'll just get a normal EEXIST error upon the namespace create, I check for it and fallthrough to the rest of the normal setup code.

            if (!job->reattach || errno != EEXIST) {
                jobinfo_fatal_error (job, errno, "failed to create guest ns");
                goto done;
            }

although that is a good point that I should write a test for this.

chu11 · 2021-11-16T23:05:48Z

just re-pushed, doing some cleanup and prefixing some of the eventlog events with debug and only outputting them when job debugging is setup.

chu11 · 2021-11-17T04:17:22Z

re-pushed with some minor cleanups, but most importantly added a test for when the guest KVS namespace is not destroyed during a "reattach".

garlick

I added a few comments, maybe not super helpful at this point. I'll make another pass in the morning when I have more than half a brain :-)

garlick · 2021-11-16T22:45:34Z

src/modules/job-exec/job-exec.c

@@ -1094,10 +1094,170 @@ static int configure_implementations (flux_t *h, int argc, char **argv)
    return 0;
 }

-static int unload_implementations (void)
+static flux_future_t *lookup_namespace_roots (struct job_exec_ctx *ctx)


Is there a way this code could be kept separate from job-exec.c with interfaces that would let it be reused more easily? Maybe it could be broken out to a separate file and take a zlistx_t of jobids fom zhashx_keys() rather than the job_exec_ctx?

don't think i can abstract it at that level of detail b/c the job->running flag has to be looked at to determine which jobs are running. but can definitely be abstracted more, perhaps just passing the jobs hash around.

garlick · 2021-11-16T22:48:39Z

src/modules/job-exec/job-exec.c

+
+    if (flux_kvs_txn_put (txn,
+                          0,
+                          "job.checkpoint.running.namespaces",


I'd say this "belongs" to job-exec so put it under a job-exec. prefix? Probably not any need for the long path either, maybe job-exec.kvs-namespaces?

garlick · 2021-11-18T00:09:15Z

src/modules/job-exec/job-exec.c

+        if (job->running) {
+            if (!fall) {
+                if (!(fall = flux_future_wait_all_create ())) {
+                    flux_log_error (ctx->h, "flux_future_wait_all_create");


Error messages could be improved in this function.

I meant to say "this commit" rather than "this function". The error messages all look like they not contain enough context to really run anything down.

garlick · 2021-11-18T00:11:17Z

src/modules/job-exec/job-exec.c

+    json_t *nsdata = NULL;
+
+    if (!(nsdata = json_array ())) {
+        flux_log (ctx->h, LOG_ERR, "json_array");


or maybe "out of memory"?

garlick · 2021-11-18T00:12:28Z

src/modules/job-manager/restart.c

+            if ((job->flags & FLUX_JOB_DEBUG)) {
+                if (event_job_post_pack (ctx->event,
+                                         job,
+                                         "debug.flux-reattach",


why is "flux" in the name?

At the moment I'm not finding the event that signifies completion but if it's still there, prob should balance the names, like debug.exec-reattach-start and debug.exec-reattach-finish. (those are the event suffixes used by prolog/epilog events)

good point, i was just following the flux-restart event from earlier. start and finish seem better.

chu11 · 2021-11-18T20:57:45Z

re-pushed with fixes based on comments above. I had to squash everything since the fixes didn't quite flow right with just fixups.

spliced out the checkpoint code into new files job-exec/checkpoint.[ch] and added better error messages
the debug.exec-reattach-start/finish events are added in a new commit, since they were previously across two different commits
new checkpoint path in the KVS

garlick

Oops started reviewing again but you are still updating. I'll just make these two comments and ask you let me know when you're ready.

garlick · 2021-11-18T22:01:37Z

src/modules/job-exec/checkpoint.c

+/* Prototype flux job exec service
+ *
+ * DESCRIPTION
+ *
+ * This module implements the exec interface as described in
+ * job-manager/start.c, but does not currently support execution of


Header cut & pasted from job-exec.c.

A paragraph or two describing what this does (and its prototype nature) would be good actually! (I got momentarily excited :-)

garlick · 2021-11-18T22:05:49Z

src/modules/job-exec/checkpoint.c

+                if (!(fall = flux_future_wait_all_create ())) {
+                    flux_log_error (h, "lookup_nsroots: "
+                                            "flux_future_wait_all_create");
+                    goto cleanup;
+                }


Looks like all the error cases in this function are equally unlikely and uninteresting to distinguish. Suggest you simply log one message down in checkpoint_running() where the function is called, like "failed to initiate KVS requests for namespace checkpoint".

Was going to make the same comment for the function after that one. Maybe whole file needs error logs audited.

chu11 · 2021-11-18T23:24:39Z

Oops started reviewing again but you are still updating. I'll just make these two comments and ask you let me know when you're ready.

Oops, sorry! Shortly after I pushed I realized I didn't remove that cut & pasted header stuff. So I tried to sneak it their removal.

Updated the headers in checkpoint.[ch], and consolidated the log messages per your comments above.

Edit: crud, build is breaking, lemme fix
Edit2: ok think i fixed it

garlick

Sorry to give you dribs and drabs. Apparently Sue ordered dinner and it's here so I'll be offline for an hour or so. Couple more comments for ya.

garlick · 2021-11-18T23:43:46Z

src/modules/job-exec/checkpoint.c

+    int rv = -1;
+
+    if (!nsdata || json_array_size (nsdata) <= 0)
+        return 0;


actually, i can just remove this check now, the logic is different than some initial work

Oh duh, that's not an error is it.

garlick · 2021-11-18T23:44:29Z

src/modules/job-exec/checkpoint.c

+    if (flux_future_wait_for (fall, -1.) < 0) {
+        flux_log_error (h, "kvs_checkpoint: flux_future_wait_for");
+        goto cleanup;
+    }


Is this required? If so, maybe combine error log with get_nsroots() since a failure here is also a failure to get ns root refs.

garlick · 2021-11-19T00:11:49Z

src/modules/job-manager/job.h

@@ -40,6 +40,7 @@ struct job {
    uint8_t free_pending:1; // free request sent to sched


In the description, could you indicate that a boolean flag was added to the start request payload? it's kind of vague with the current wording. Also, it's for any job that's in the RUN state per its replayed event log, not a "belief" per se (you're giving job manager too much credit!)

chu11 · 2021-11-19T01:04:04Z

re-pushed with some minor fixes per discussion above and offline.

checkpointing now also removes namespaces that were checkpointed.
minor tweaks to some checkpoint logic
don't output error when there is nothing to checkpoint (logic error I had earlier)
update some descriptions

garlick

This seems pretty close. Just one comment about location of source.

garlick · 2021-11-19T04:21:33Z

src/modules/job-exec/job-exec.c

+static void get_rootref_cb (flux_future_t *fprev, void *arg)
+{
+    int saved_errno;


Should this code for restoring namespaces be in the same source file as the code to save them?

i assume you meant "removing namespaces". Good point, perhaps it should be in job-exec.c b/c that's where we create/setup namespaces too. I'll move it in there.

chu11 · 2021-11-19T20:26:57Z

re-pushed, the remove of recently checkpointed namespaces is now within job-exec.c. I spliced it out into its own commit within the series

garlick · 2021-11-19T23:18:23Z

Well, my thought was that if checkpoint_running() in checkpoint.c writes the checkpoint object, that ns_get_rootref() which reads the checkpoint object should also be in checkpoint.c. That way the details of the checkpoint object wouldn't have to be exposed outside of that source file. Would that potentially work?

chu11 · 2021-11-19T23:31:42Z

Well, my thought was that if checkpoint_running() in checkpoint.c writes the checkpoint object, that ns_get_rootref() which reads the checkpoint object should also be in checkpoint.c. That way the details of the checkpoint object wouldn't have to be exposed outside of that source file. Would that potentially work?

Ohh sorry, I didn't know you were referring to that half of this PR. Yes, I think would work work and is a good idea too. Let me do that.

I do think that the "remove namespaces" can stay in job-exec though, since its independent of the checkpointing.

garlick · 2021-11-19T23:45:13Z

No my bad - I was not very clear about what code I was referring to!

chu11 · 2021-11-20T01:05:23Z

re-pushed, I added

flux_future_t *checkpoint_get_rootrefs (flux_t *h);

char *checkpoint_find_rootref (flux_future_t *f,
                               flux_jobid_t id,
                               uint32_t owner);

as helper functions to get the checkpoint object and find a rootref within it. B/c of all of the flux future compositions / continuations, I didn't want to move that code into the checkpoint lib.

garlick · 2021-11-20T03:09:29Z

Thanks @chu11 - this looks good. Let's get this merged so we can keep the momentum up!

Nice work!

Problem: The jobinfo_started() function took a fmt parameter and variable args, but never used them. Solution: Remove the variable arguments options to jobinfo_started(). Adjust callers accordingly.

Problem: If the job-exec module is unloaded with running jobs, we have no way to recreate the KVS namespaces for those jobs if we wish to re-attach to them later. Solution: Checkpoint the KVS root references for any KVS namespaces of running jobs.

Problem: After checkpoint of running namespaces, we do not want running jobs to continue to write to that guest namespace. Solution: On job-exec unload, remove the namespaces we just checkpointed.

Problem: When the job-manager is re-loaded discovers a job that's in the RUN state, there is no way to inform the job exec module that the job should still be running. Solution: Add a "reattach" flag to the job-exec.start RPC. This flag informs the job-exec module that the job should still be running.

Problem: The testexec exec implementation does not parse and handle the "reattach" flag from the job-manager. Solution: If the testexec implementation sees the "reattach" flag from the job-manager, emulate that job is still running by running the job for the remaining time it should given the job's start time. Emit a "re-starting" event indicating this restart and notify the job-manager via a "reattach" event.

Problem: If re-attaching to an already running job, re-create the KVS namespace based on the previously checkpointed root reference for the job.

Problem: Job "reattach" is difficult to test at the moment. Solution: Add reattach start and finish events to aid in testing.

Problem: rc1 cancels all jobs when an instance is exited, but that may not be desireable all of the time, such as some testing scenarios. Solution: Support an environment variable FLUX_INSTANCE_RESTART to notify rc1 that we instance restart is occurring and to not cancel jobs upon instance shutdown.

Problem: Under test scenarios, it may be difficult to reattach to a job that "already ended". Solution: Support a flag that will assume a reattached job has already finished and will go through the normal process for an already completed job.

Add initial tests to see that jobs can survive instance restarts using the job-exec testexec execution plugin.

codecov · 2021-11-20T06:23:25Z

Codecov Report

Merging #3947 (61df254) into master (90cc9c8) will decrease coverage by 0.04%.
The diff coverage is 76.70%.

@@            Coverage Diff             @@
##           master    #3947      +/-   ##
==========================================
- Coverage   83.54%   83.49%   -0.05%     
==========================================
  Files         358      359       +1     
  Lines       52764    52987     +223     
==========================================
+ Hits        44080    44240     +160     
- Misses       8684     8747      +63

Impacted Files	Coverage Δ
src/modules/job-manager/restart.c	`83.45% <71.42%> (-0.01%)`	⬇️
src/modules/job-exec/checkpoint.c	`72.38% <72.38%> (ø)`
src/modules/job-exec/testexec.c	`85.07% <74.54%> (-2.83%)`	⬇️
src/modules/job-exec/job-exec.c	`76.67% <83.56%> (+0.50%)`	⬆️
src/modules/job-manager/start.c	`73.33% <87.50%> (+0.52%)`	⬆️
src/modules/job-exec/exec.c	`79.28% <100.00%> (ø)`
src/broker/overlay.c	`87.44% <0.00%> (-1.25%)`	⬇️
src/broker/broker.c	`76.58% <0.00%> (-0.41%)`	⬇️
src/cmd/flux-module.c	`80.83% <0.00%> (+0.34%)`	⬆️
... and 3 more

chu11 changed the title ~~job-manager / job-exec: checkpoint and restore guest KVS namespaces~~ [WIP] job-manager / job-exec: checkpoint and restore guest KVS namespaces Nov 10, 2021

chu11 mentioned this pull request Nov 10, 2021

segfault flux_future_destroy() on composite future that does not set handle/reactor #3949

Closed

grondo assigned chu11 Nov 11, 2021

chu11 force-pushed the issue3811_kvscheckpoint branch from 2bd4ebb to 17a0396 Compare November 11, 2021 22:09

chu11 force-pushed the issue3811_kvscheckpoint branch from 569e1de to c079a0d Compare November 12, 2021 22:51

chu11 mentioned this pull request Nov 12, 2021

[WIP] rfc21: document job reattach events flux-framework/rfc#303

Closed

chu11 force-pushed the issue3811_kvscheckpoint branch from c079a0d to 567fc3d Compare November 15, 2021 18:07

chu11 changed the title ~~[WIP] job-manager / job-exec: checkpoint and restore guest KVS namespaces~~ job-manager / job-exec: checkpoint and restore guest KVS namespaces Nov 15, 2021

chu11 mentioned this pull request Nov 15, 2021

kvs: support mechanism to checkpoint and restore guest namespaces #3811

Closed

chu11 force-pushed the issue3811_kvscheckpoint branch 2 times, most recently from 0cb4678 to 9249f51 Compare November 16, 2021 23:04

chu11 force-pushed the issue3811_kvscheckpoint branch from 9249f51 to 1af7735 Compare November 17, 2021 04:16

garlick reviewed Nov 18, 2021

View reviewed changes

chu11 force-pushed the issue3811_kvscheckpoint branch 2 times, most recently from 5385406 to 8dbabef Compare November 18, 2021 20:56

chu11 force-pushed the issue3811_kvscheckpoint branch from 8dbabef to d0a26cd Compare November 18, 2021 22:08

garlick reviewed Nov 18, 2021

View reviewed changes

chu11 force-pushed the issue3811_kvscheckpoint branch from d0a26cd to 7743da7 Compare November 18, 2021 23:23

garlick reviewed Nov 19, 2021

View reviewed changes

chu11 force-pushed the issue3811_kvscheckpoint branch from 95945d5 to a6de4ab Compare November 19, 2021 01:03

garlick approved these changes Nov 19, 2021

View reviewed changes

chu11 force-pushed the issue3811_kvscheckpoint branch from a6de4ab to c27a8ae Compare November 19, 2021 20:25

chu11 force-pushed the issue3811_kvscheckpoint branch from c27a8ae to e4b2967 Compare November 20, 2021 01:03

chu11 added 13 commits November 19, 2021 21:46

libkvs: fix comment typo

0ce30db

job-exec: fix comment typo

bc3848a

job-exec: remove unused function param

25208d9

job-exec: remove unused var args

322868f

Problem: The jobinfo_started() function took a fmt parameter and variable args, but never used them. Solution: Remove the variable arguments options to jobinfo_started(). Adjust callers accordingly.

job-exec: remove namespaces on unload

5161df0

Problem: After checkpoint of running namespaces, we do not want running jobs to continue to write to that guest namespace. Solution: On job-exec unload, remove the namespaces we just checkpointed.

job-exec: restart KVS namespace with checkpoint

9603082

Problem: If re-attaching to an already running job, re-create the KVS namespace based on the previously checkpointed root reference for the job.

job-manager: add reattach debug events

5990984

Problem: Job "reattach" is difficult to test at the moment. Solution: Add reattach start and finish events to aid in testing.

testsuite: Add job instance restart tests

61df254

Add initial tests to see that jobs can survive instance restarts using the job-exec testexec execution plugin.

chu11 force-pushed the issue3811_kvscheckpoint branch from e4b2967 to 61df254 Compare November 20, 2021 05:46

chu11 added the merge-when-passing label Nov 20, 2021

mergify bot merged commit 562d29b into flux-framework:master Nov 20, 2021

chu11 deleted the issue3811_kvscheckpoint branch December 17, 2021 18:13

		@@ -40,6 +40,7 @@ struct job {
		uint8_t free_pending:1; // free request sent to sched

job-manager / job-exec: checkpoint and restore guest KVS namespaces #3947

job-manager / job-exec: checkpoint and restore guest KVS namespaces #3947

Conversation

chu11 commented Nov 9, 2021 • edited Loading

garlick commented Nov 10, 2021

grondo commented Nov 10, 2021

chu11 commented Nov 10, 2021

chu11 commented Nov 11, 2021 • edited Loading

chu11 commented Nov 12, 2021

chu11 commented Nov 15, 2021

garlick commented Nov 16, 2021 • edited Loading

chu11 commented Nov 16, 2021

chu11 commented Nov 16, 2021

chu11 commented Nov 17, 2021

garlick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Nov 18, 2021 • edited Loading

garlick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Nov 18, 2021 • edited Loading

garlick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Nov 19, 2021

garlick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Nov 19, 2021

garlick commented Nov 19, 2021

chu11 commented Nov 19, 2021 • edited Loading

garlick commented Nov 19, 2021

chu11 commented Nov 20, 2021

garlick commented Nov 20, 2021

codecov bot commented Nov 20, 2021

Codecov Report

chu11 commented Nov 9, 2021 •

edited

Loading

chu11 commented Nov 11, 2021 •

edited

Loading

garlick commented Nov 16, 2021 •

edited

Loading

chu11 commented Nov 18, 2021 •

edited

Loading

chu11 commented Nov 18, 2021 •

edited

Loading

chu11 commented Nov 19, 2021 •

edited

Loading