-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling workflows after Flux restart #188
Comments
We no longer run the job epilog on elcap, as we've transitioned to use the housekeeping service. I wonder if the problem is that there's a race at startup and the |
The housekeeping service is a replacement for the administrative epilog right? But not a replacement for a jobtap epilog actions in general?
Hmmm is that an expected race condition? If so I could maybe work to mitigate it. |
I don't think it is expected, but perhaps something we didn't think about. I haven't verified that's the case BTW. |
And now re-reading I see you were talking about the dws epilog, not the job manager/administrative epilog. I'll open an issue in flux-core. |
A kubernetes Workflow was stranded last night on elcap, and required manual intervention to remove, and I suspect it had something to do with the elcapi crash last night.
The flux-coral2 service creates Workflow objects and is responsible for destroying them. However, the trigger to destroy them is an RPC that is sent in a
job.state.cleanup
jobtap plugin callback. The same callback adds an epilog, but I don't see the epilog in the eventlog:@garlick or @grondo , do you know what happened last night on elcap / under what cases the epilog wouldn't run?
The text was updated successfully, but these errors were encountered: