Support instance checkpoint/restart #3552

grondo · 2021-03-11T03:06:45Z

@mplegendre had asked if there was a way to submit a large set of jobs to a Flux instance and restart that instance in another allocation if all the jobs were not able to complete within the time limit of the original job.

As described in this discussions comment by @garlick, we're close to enabling this feature. We should determine the following:

Is there some way to get this functionality now with some scripting in the initial program of a user-level instance?
what would be required for full functionality, i.e. something that could be used to automatically recover from large instances that crash due to nodes going down, etc.?

garlick · 2021-04-21T17:07:16Z

We should generally think about a "defensive checkpoint" capability that could be enabled on both user and system instances for automatic and manual recovery

grondo · 2022-03-31T03:33:56Z

fixed by #4208?

grondo mentioned this issue Mar 31, 2021

kvs: log date of restored checkpoint #3580

Closed

garlick changed the title ~~Support user-instance checkpoint/restart~~ Support instance checkpoint/restart Apr 21, 2021

garlick mentioned this issue Aug 3, 2021

kvs: support mechanism to checkpoint and restore guest namespaces #3811

Closed

grondo closed this as completed Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support instance checkpoint/restart #3552

Support instance checkpoint/restart #3552

grondo commented Mar 11, 2021

garlick commented Apr 21, 2021

grondo commented Mar 31, 2022

Support instance checkpoint/restart #3552

Support instance checkpoint/restart #3552

Comments

grondo commented Mar 11, 2021

garlick commented Apr 21, 2021

grondo commented Mar 31, 2022