Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs: support mechanism to checkpoint and restore guest namespaces #3811

Closed
garlick opened this issue Aug 3, 2021 · 8 comments
Closed

kvs: support mechanism to checkpoint and restore guest namespaces #3811

garlick opened this issue Aug 3, 2021 · 8 comments
Labels
design don't expect this to ever be closed...

Comments

@garlick
Copy link
Member

garlick commented Aug 3, 2021

Problem: once the exec system is capable of preserving running jobs across an instance restart (#3801), KVS guest namespaces for those jobs will need to be preserved as well.

Note that the primary kvs namespace final root reference is written out to the content backing store when flux is shut down. This may be extended to include defensive checkpoints of the primary namespace as discussed in #3552.

A design should consider that we want the capability to garbage collect un-referenced blobs from the content backing store while the instance is shut down (#258). A hypothetical utility would create a new content backing store and only copy over the content blobs from the old backing store that are referenced from the most recent primary namespace checkpoint. Blobs referenced only by guest namespaces would be lost. This means that it is not helpful for, say, the rank 0 job shell for a job to track the most recent rootref returned from a guest namespace KVS commit, since what it points to might be garbage collected. (It might be helpful for detecting that data has been lost however).

The job-exec module is responsible for linking the final snapshot of the guest namespace into the primary namespace once the job completes. Maybe for an initial cut, the job-exec module could do the same for any running jobs during its shutdown, and then on startup, recreate the guest namespaces from the snapshots (after first discovering which jobs are supposed to still be running, method TBD).

The initial cut could then be extended for defensive checkpointing some way, to allow for recovery from a rank 0 crash.

An alternative to defensive checkpointing of guest namespaces, which might have scalability ramifications, is to have the rank 0 shell "back up" any critical job data that has been written to the guest namespace and restore it upon request.

@garlick garlick added the design don't expect this to ever be closed... label Aug 3, 2021
@grondo
Copy link
Contributor

grondo commented Aug 3, 2021

Just curious: since the kvs is already writing out final root reference on shutdown, why not have it also checkpoint currently active namespaces, so they can be resurrected at startup? i.e. wouldn't it be simpler for the kvs just to recreate namespaces from a checkpoint, than to have job-exec have to go through the process of linking incomplete namespaces into the main namespace, only to undo that on restart? (I'm guessing there is some underlying reason the proposed solution is better that I'm not seeing)

One benefit of taking care of this in the kvs is that namespaces created for any other purpose would be preserved, though admittedly the only use case now is for running jobs...

@garlick
Copy link
Member Author

garlick commented Aug 3, 2021

There are some complications if the KVS were to checkpoint guest namespaces the same way it does the primary, since guest namespaces are created and destroyed all the time with ephemeral names (e.g. avoid resurrecting dead namespaces, associating a set of namespaces with a given defensive primary checkpoint, etc). I think you're right though, this would properly be handled by the KVS and not foisted on the job-exec module if possible.

Maybe the way it could do it is to write a json object out as a KVS key in the primary namespace. That way you could just store a circular buffer of references to the primary namespace, and be able to restore any of them and get the set of namespaces active at that time. The KVS GC tool could just walk the latest checkpoint and not need to walk each namespace separately.

There is maybe a concern if the KVS restores from an older checkpoint and a running job is "discovered' that doesn't have a guest namespace yet. But maybe in that case the job exec service could just kill those jobs and log a fatal exception, at least initially.

One other point for us to keep in mind is that namespaces have owners that would need to be restored. It's not just name vs blobref.

Edit: just realized the above json object would need to be an RFC 11 tree object for that to work, and we don't have one of those that includes the namespace owner. So more thought required there.

@grondo
Copy link
Contributor

grondo commented Aug 3, 2021

I was also thinking about this and realizing one benefit to the initial proposal (linking all active namespaces into the main namespace via job-exec) is that the main namespace from the checkpoint would be "usable" if doing some sort of post-mortem (e.g. loading the kvs in a single broker to grab data). Though a post mortem might be most likely after a rank 0 crash and not a normal shutdown...

@garlick
Copy link
Member Author

garlick commented Aug 3, 2021

An alternative would be to add a "checkpoint object" to RFC 11 that lists all namespaces, owners, and a root reference. On request, and when stopping, the KVS could create such an object and write it out to the content cache, then save its blobref to the backing store. At startup, the KVS could get the most recent blobref from the backing store, retrieve the object, and create all namespaces with the saved state. With options, you could maybe tell flux to start from an earlier snapshot, or a named one.

@chu11
Copy link
Member

chu11 commented Nov 1, 2021

Was prototyping a dumb implementation per offline discussion. Just thought I'd put down some notes based on how things are going. Probably need to solve a few of these before this can be looked at more closely.

  1. There is presently no way for a KVS namespace to be created and "initialized" at a specific root. Probably needs to be dealt with in the namespace create operation.

1A) offline discussion with @garlick, he suggested a hack in which we can write a treeobj to . with the root we want, but that is currently not allowed by the KVS (i.e. writing to . is a special case exception that is presently not allowed).

  1. it is possible (such as if job-exec is unloaded but the KVS module is not) that the KVS namespace for a job will not be removed. So the ability to "reset" a namespace root to a new root may be needed?

2A) OR we could delete a namespace and recreate it, but namespaces are currently "garbage collected" in the background, so that may not be a good idea or namespace removal needs to be rethought.

@chu11
Copy link
Member

chu11 commented Nov 9, 2021

Side note on additional workarounds needed - how to calculate total runtime or other timing based things that use timestamps from the eventlog. For example, if job is completed by the time job-exec has "re-attached" to the job, calculating "total runtime" via the eventlog timestamps won't work.

But at the same time, dunno if the above case is possible. Unless we grab info from systemd or something.

@chu11
Copy link
Member

chu11 commented Nov 15, 2021

As I've been working on PR #3947, was thinking about how we could do defensive checkpointing.

  1. Perhaps we could have a directory of checkpoint data, hypothetically something like:
job.checkpoints.running.<jobid>.owner = 1234
job.checkpoints.running.<jobid>.rootref = sha1-123456789abcdef

and only update only when a change we care about happens, like a change to the exec.eventlog. Or perhaps checkpoint
based on a timer if we care to do that. (Edit: or an RPC on user request)

pro: should be easy to maintain a hash of "running" jobs. Can keep track of jobs that had changes to key fields and those that didn't to limit KVS commit churn.

con: potentially large directory with lots of fields, gotta manage add/remove of the fields. large directory issues could be mitigated if KVS large directory support done (#1206 / #1207)

  1. store all checkpoint data within a single json object and update that as needed. Then defensively checkpoint based on when important changes we care about.

    • pro: only change 1 KVS key vs adding/removing/changing tons

    • con: the json object could get large and regularly committing that can get slow over time. gotta serialize the json object, which could be expensive.

@garlick
Copy link
Member Author

garlick commented Feb 24, 2022

I think this should have been closed by #3947.

We could open a separate issue on defensive checkpointing if need be.

Possibly I'm forgetting that some critical bit of important work still needs to be done here - reopen if so.

@garlick garlick closed this as completed Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design don't expect this to ever be closed...
Projects
None yet
Development

No branches or pull requests

3 participants