-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvs: support mechanism to checkpoint and restore guest namespaces #3811
Comments
Just curious: since the kvs is already writing out final root reference on shutdown, why not have it also checkpoint currently active namespaces, so they can be resurrected at startup? i.e. wouldn't it be simpler for the kvs just to recreate namespaces from a checkpoint, than to have job-exec have to go through the process of linking incomplete namespaces into the main namespace, only to undo that on restart? (I'm guessing there is some underlying reason the proposed solution is better that I'm not seeing) One benefit of taking care of this in the kvs is that namespaces created for any other purpose would be preserved, though admittedly the only use case now is for running jobs... |
There are some complications if the KVS were to checkpoint guest namespaces the same way it does the primary, since guest namespaces are created and destroyed all the time with ephemeral names (e.g. avoid resurrecting dead namespaces, associating a set of namespaces with a given defensive primary checkpoint, etc). I think you're right though, this would properly be handled by the KVS and not foisted on the job-exec module if possible. Maybe the way it could do it is to write a json object out as a KVS key in the primary namespace. That way you could just store a circular buffer of references to the primary namespace, and be able to restore any of them and get the set of namespaces active at that time. The KVS GC tool could just walk the latest checkpoint and not need to walk each namespace separately. There is maybe a concern if the KVS restores from an older checkpoint and a running job is "discovered' that doesn't have a guest namespace yet. But maybe in that case the job exec service could just kill those jobs and log a fatal exception, at least initially. One other point for us to keep in mind is that namespaces have owners that would need to be restored. It's not just name vs blobref. Edit: just realized the above json object would need to be an RFC 11 tree object for that to work, and we don't have one of those that includes the namespace owner. So more thought required there. |
I was also thinking about this and realizing one benefit to the initial proposal (linking all active namespaces into the main namespace via job-exec) is that the main namespace from the checkpoint would be "usable" if doing some sort of post-mortem (e.g. loading the kvs in a single broker to grab data). Though a post mortem might be most likely after a rank 0 crash and not a normal shutdown... |
An alternative would be to add a "checkpoint object" to RFC 11 that lists all namespaces, owners, and a root reference. On request, and when stopping, the KVS could create such an object and write it out to the content cache, then save its blobref to the backing store. At startup, the KVS could get the most recent blobref from the backing store, retrieve the object, and create all namespaces with the saved state. With options, you could maybe tell flux to start from an earlier snapshot, or a named one. |
Was prototyping a dumb implementation per offline discussion. Just thought I'd put down some notes based on how things are going. Probably need to solve a few of these before this can be looked at more closely.
1A) offline discussion with @garlick, he suggested a hack in which we can write a treeobj to
2A) OR we could delete a namespace and recreate it, but namespaces are currently "garbage collected" in the background, so that may not be a good idea or namespace removal needs to be rethought. |
Side note on additional workarounds needed - how to calculate total runtime or other timing based things that use timestamps from the eventlog. For example, if job is completed by the time job-exec has "re-attached" to the job, calculating "total runtime" via the eventlog timestamps won't work. But at the same time, dunno if the above case is possible. Unless we grab info from systemd or something. |
As I've been working on PR #3947, was thinking about how we could do defensive checkpointing.
and only update only when a change we care about happens, like a change to the exec.eventlog. Or perhaps checkpoint pro: should be easy to maintain a hash of "running" jobs. Can keep track of jobs that had changes to key fields and those that didn't to limit KVS commit churn. con: potentially large directory with lots of fields, gotta manage add/remove of the fields. large directory issues could be mitigated if KVS large directory support done (#1206 / #1207)
|
I think this should have been closed by #3947. We could open a separate issue on defensive checkpointing if need be. Possibly I'm forgetting that some critical bit of important work still needs to be done here - reopen if so. |
Problem: once the exec system is capable of preserving running jobs across an instance restart (#3801), KVS guest namespaces for those jobs will need to be preserved as well.
Note that the primary kvs namespace final root reference is written out to the content backing store when flux is shut down. This may be extended to include defensive checkpoints of the primary namespace as discussed in #3552.
A design should consider that we want the capability to garbage collect un-referenced blobs from the content backing store while the instance is shut down (#258). A hypothetical utility would create a new content backing store and only copy over the content blobs from the old backing store that are referenced from the most recent primary namespace checkpoint. Blobs referenced only by guest namespaces would be lost. This means that it is not helpful for, say, the rank 0 job shell for a job to track the most recent rootref returned from a guest namespace KVS commit, since what it points to might be garbage collected. (It might be helpful for detecting that data has been lost however).
The job-exec module is responsible for linking the final snapshot of the guest namespace into the primary namespace once the job completes. Maybe for an initial cut, the job-exec module could do the same for any running jobs during its shutdown, and then on startup, recreate the guest namespaces from the snapshots (after first discovering which jobs are supposed to still be running, method TBD).
The initial cut could then be extended for defensive checkpointing some way, to allow for recovery from a rank 0 crash.
An alternative to defensive checkpointing of guest namespaces, which might have scalability ramifications, is to have the rank 0 shell "back up" any critical job data that has been written to the guest namespace and restore it upon request.
The text was updated successfully, but these errors were encountered: