-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-manager / job-exec: checkpoint and restore guest KVS namespaces #3947
job-manager / job-exec: checkpoint and restore guest KVS namespaces #3947
Conversation
Haven't had a chance to review but, yay! Nice! |
Yeah, me either, but nice work getting us moving forward on this @chu11! It seems like you are very close to "restarting" from an existing sqlite content backing store. Any idea what prevents that currently (i.e. why not restart an instance entirely from the same Excited to try this one out! (and double points for getting the testexec jobs to have the correct duration even after a reattach) 🙂 |
It was mostly the fact that jobs are killed in rc1/rc3. I think its doable, but just haven't tried yet :-) |
2bd4ebb
to
17a0396
Compare
just re-pushed with various cleanup so the code isn't quite as big a mess. Also verified that having a set content store works too. i.e. this works
And after this the job is still listed in Edit: added some tests too |
569e1de
to
c079a0d
Compare
re-pushed, just adding some tests and some instrumentation to try and get more test coverage |
c079a0d
to
567fc3d
Compare
removing WIP as I think this has reached a point of being reasonably reviewable |
Just to double check before digging in - if the namespace already exists, we don't restore a potentially older rootref on top of it do we? |
0cb4678
to
9249f51
Compare
No, if the namespace already exists you'll just get a normal EEXIST error upon the namespace create, I check for it and fallthrough to the rest of the normal setup code.
although that is a good point that I should write a test for this. |
just re-pushed, doing some cleanup and prefixing some of the eventlog events with |
9249f51
to
1af7735
Compare
re-pushed with some minor cleanups, but most importantly added a test for when the guest KVS namespace is not destroyed during a "reattach". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few comments, maybe not super helpful at this point. I'll make another pass in the morning when I have more than half a brain :-)
src/modules/job-exec/job-exec.c
Outdated
@@ -1094,10 +1094,170 @@ static int configure_implementations (flux_t *h, int argc, char **argv) | |||
return 0; | |||
} | |||
|
|||
static int unload_implementations (void) | |||
static flux_future_t *lookup_namespace_roots (struct job_exec_ctx *ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way this code could be kept separate from job-exec.c with interfaces that would let it be reused more easily? Maybe it could be broken out to a separate file and take a zlistx_t of jobids fom zhashx_keys() rather than the job_exec_ctx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think i can abstract it at that level of detail b/c the job->running
flag has to be looked at to determine which jobs are running. but can definitely be abstracted more, perhaps just passing the jobs hash around.
src/modules/job-exec/job-exec.c
Outdated
|
||
if (flux_kvs_txn_put (txn, | ||
0, | ||
"job.checkpoint.running.namespaces", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say this "belongs" to job-exec so put it under a job-exec.
prefix? Probably not any need for the long path either, maybe job-exec.kvs-namespaces
?
src/modules/job-exec/job-exec.c
Outdated
if (job->running) { | ||
if (!fall) { | ||
if (!(fall = flux_future_wait_all_create ())) { | ||
flux_log_error (ctx->h, "flux_future_wait_all_create"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error messages could be improved in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to say "this commit" rather than "this function". The error messages all look like they not contain enough context to really run anything down.
src/modules/job-exec/job-exec.c
Outdated
json_t *nsdata = NULL; | ||
|
||
if (!(nsdata = json_array ())) { | ||
flux_log (ctx->h, LOG_ERR, "json_array"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe "out of memory"?
src/modules/job-manager/restart.c
Outdated
if ((job->flags & FLUX_JOB_DEBUG)) { | ||
if (event_job_post_pack (ctx->event, | ||
job, | ||
"debug.flux-reattach", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is "flux" in the name?
At the moment I'm not finding the event that signifies completion but if it's still there, prob should balance the names, like debug.exec-reattach-start
and debug.exec-reattach-finish
. (those are the event suffixes used by prolog/epilog events)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, i was just following the flux-restart
event from earlier. start and finish seem better.
5385406
to
8dbabef
Compare
re-pushed with fixes based on comments above. I had to squash everything since the fixes didn't quite flow right with just fixups.
|
8dbabef
to
d0a26cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops started reviewing again but you are still updating. I'll just make these two comments and ask you let me know when you're ready.
src/modules/job-exec/checkpoint.c
Outdated
/* Prototype flux job exec service | ||
* | ||
* DESCRIPTION | ||
* | ||
* This module implements the exec interface as described in | ||
* job-manager/start.c, but does not currently support execution of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Header cut & pasted from job-exec.c.
A paragraph or two describing what this does (and its prototype nature) would be good actually! (I got momentarily excited :-)
src/modules/job-exec/checkpoint.c
Outdated
if (!(fall = flux_future_wait_all_create ())) { | ||
flux_log_error (h, "lookup_nsroots: " | ||
"flux_future_wait_all_create"); | ||
goto cleanup; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like all the error cases in this function are equally unlikely and uninteresting to distinguish. Suggest you simply log one message down in checkpoint_running()
where the function is called, like "failed to initiate KVS requests for namespace checkpoint".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was going to make the same comment for the function after that one. Maybe whole file needs error logs audited.
d0a26cd
to
7743da7
Compare
Oops, sorry! Shortly after I pushed I realized I didn't remove that cut & pasted header stuff. So I tried to sneak it their removal. Updated the headers in Edit: crud, build is breaking, lemme fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to give you dribs and drabs. Apparently Sue ordered dinner and it's here so I'll be offline for an hour or so. Couple more comments for ya.
src/modules/job-exec/checkpoint.c
Outdated
int rv = -1; | ||
|
||
if (!nsdata || json_array_size (nsdata) <= 0) | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set errno
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, i can just remove this check now, the logic is different than some initial work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh duh, that's not an error is it.
src/modules/job-exec/checkpoint.c
Outdated
if (flux_future_wait_for (fall, -1.) < 0) { | ||
flux_log_error (h, "kvs_checkpoint: flux_future_wait_for"); | ||
goto cleanup; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required? If so, maybe combine error log with get_nsroots()
since a failure here is also a failure to get ns root refs.
@@ -40,6 +40,7 @@ struct job { | |||
uint8_t free_pending:1; // free request sent to sched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the description, could you indicate that a boolean flag was added to the start request payload? it's kind of vague with the current wording. Also, it's for any job that's in the RUN state per its replayed event log, not a "belief" per se (you're giving job manager too much credit!)
95945d5
to
a6de4ab
Compare
re-pushed with some minor fixes per discussion above and offline.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems pretty close. Just one comment about location of source.
static void get_rootref_cb (flux_future_t *fprev, void *arg) | ||
{ | ||
int saved_errno; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this code for restoring namespaces be in the same source file as the code to save them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume you meant "removing namespaces". Good point, perhaps it should be in job-exec.c
b/c that's where we create/setup namespaces too. I'll move it in there.
a6de4ab
to
c27a8ae
Compare
re-pushed, the remove of recently checkpointed namespaces is now within |
Well, my thought was that if |
Ohh sorry, I didn't know you were referring to that half of this PR. Yes, I think would work work and is a good idea too. Let me do that. I do think that the "remove namespaces" can stay in job-exec though, since its independent of the checkpointing. |
No my bad - I was not very clear about what code I was referring to! |
c27a8ae
to
e4b2967
Compare
re-pushed, I added
as helper functions to get the checkpoint object and find a rootref within it. B/c of all of the flux future compositions / continuations, I didn't want to move that code into the |
Thanks @chu11 - this looks good. Let's get this merged so we can keep the momentum up! Nice work! |
Problem: The jobinfo_started() function took a fmt parameter and variable args, but never used them. Solution: Remove the variable arguments options to jobinfo_started(). Adjust callers accordingly.
Problem: If the job-exec module is unloaded with running jobs, we have no way to recreate the KVS namespaces for those jobs if we wish to re-attach to them later. Solution: Checkpoint the KVS root references for any KVS namespaces of running jobs.
Problem: After checkpoint of running namespaces, we do not want running jobs to continue to write to that guest namespace. Solution: On job-exec unload, remove the namespaces we just checkpointed.
Problem: When the job-manager is re-loaded discovers a job that's in the RUN state, there is no way to inform the job exec module that the job should still be running. Solution: Add a "reattach" flag to the job-exec.start RPC. This flag informs the job-exec module that the job should still be running.
Problem: The testexec exec implementation does not parse and handle the "reattach" flag from the job-manager. Solution: If the testexec implementation sees the "reattach" flag from the job-manager, emulate that job is still running by running the job for the remaining time it should given the job's start time. Emit a "re-starting" event indicating this restart and notify the job-manager via a "reattach" event.
Problem: If re-attaching to an already running job, re-create the KVS namespace based on the previously checkpointed root reference for the job.
Problem: Job "reattach" is difficult to test at the moment. Solution: Add reattach start and finish events to aid in testing.
Problem: rc1 cancels all jobs when an instance is exited, but that may not be desireable all of the time, such as some testing scenarios. Solution: Support an environment variable FLUX_INSTANCE_RESTART to notify rc1 that we instance restart is occurring and to not cancel jobs upon instance shutdown.
Problem: Under test scenarios, it may be difficult to reattach to a job that "already ended". Solution: Support a flag that will assume a reattached job has already finished and will go through the normal process for an already completed job.
Add initial tests to see that jobs can survive instance restarts using the job-exec testexec execution plugin.
e4b2967
to
61df254
Compare
Codecov Report
@@ Coverage Diff @@
## master #3947 +/- ##
==========================================
- Coverage 83.54% 83.49% -0.05%
==========================================
Files 358 359 +1
Lines 52764 52987 +223
==========================================
+ Hits 44080 44240 +160
- Misses 8684 8747 +63
|
Thought I'd throw up this WIP if there are any high level thoughts / comments. This is an prototype set of changes for checkpointing KVS namespaces when a broker goes down and "re-attaching" to an already running job. What this PR does is:
A) upon unloading the job-exec module, checkpoint the rootref of the KVS namespaces of still running jobs
B) upon re-load, if the job is believed to still be running, send
job-exec
a "re-attach" flag fromjob-manager
C) upon re-load, have
job-exec
re-build KVS namespace with checkpointed rootrefD) upon "re-attach",
testexec
will simulate remaining runtime of the job based on the starttime of the job (i.e. perhapstestexec
thinks the job has run 90 seconds out of 100 seconds, simulate a run for just 10 more seconds)This currently works with tests like this:
the eventlogs for the above test look like this (we would need RFC changes to document any new events we put in here). Some of the guest event log stuff is just for debugging / info.
need to cleanup code, try and make some code non-synchronous, need to make tests, but I think this is a first good step