You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@mplegendre had asked if there was a way to submit a large set of jobs to a Flux instance and restart that instance in another allocation if all the jobs were not able to complete within the time limit of the original job.
As described in this discussions comment by @garlick, we're close to enabling this feature. We should determine the following:
Is there some way to get this functionality now with some scripting in the initial program of a user-level instance?
what would be required for full functionality, i.e. something that could be used to automatically recover from large instances that crash due to nodes going down, etc.?
The text was updated successfully, but these errors were encountered:
We should generally think about a "defensive checkpoint" capability that could be enabled on both user and system instances for automatic and manual recovery
@mplegendre had asked if there was a way to submit a large set of jobs to a Flux instance and restart that instance in another allocation if all the jobs were not able to complete within the time limit of the original job.
As described in this discussions comment by @garlick, we're close to enabling this feature. We should determine the following:
The text was updated successfully, but these errors were encountered: