Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support instance checkpoint/restart #3552

Closed
grondo opened this issue Mar 11, 2021 · 2 comments
Closed

Support instance checkpoint/restart #3552

grondo opened this issue Mar 11, 2021 · 2 comments

Comments

@grondo
Copy link
Contributor

grondo commented Mar 11, 2021

@mplegendre had asked if there was a way to submit a large set of jobs to a Flux instance and restart that instance in another allocation if all the jobs were not able to complete within the time limit of the original job.

As described in this discussions comment by @garlick, we're close to enabling this feature. We should determine the following:

  1. Is there some way to get this functionality now with some scripting in the initial program of a user-level instance?
  2. what would be required for full functionality, i.e. something that could be used to automatically recover from large instances that crash due to nodes going down, etc.?
@garlick garlick changed the title Support user-instance checkpoint/restart Support instance checkpoint/restart Apr 21, 2021
@garlick
Copy link
Member

garlick commented Apr 21, 2021

We should generally think about a "defensive checkpoint" capability that could be enabled on both user and system instances for automatic and manual recovery

@grondo
Copy link
Contributor Author

grondo commented Mar 31, 2022

fixed by #4208?

@grondo grondo closed this as completed Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants