Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling workflows after Flux restart #188

Open
jameshcorbett opened this issue Jul 26, 2024 · 4 comments
Open

Handling workflows after Flux restart #188

jameshcorbett opened this issue Jul 26, 2024 · 4 comments

Comments

@jameshcorbett
Copy link
Member

A kubernetes Workflow was stranded last night on elcap, and required manual intervention to remove, and I suspect it had something to do with the elcapi crash last night.

The flux-coral2 service creates Workflow objects and is responsible for destroying them. However, the trigger to destroy them is an RPC that is sent in a job.state.cleanup jobtap plugin callback. The same callback adds an epilog, but I don't see the epilog in the eventlog:

1722005113.437951 prolog-finish description="dws-setup" status=0
1722005113.440012 exception type="exec" severity=0 userid=... note="failed to create guest ns: No such file or directory"
1722005113.440154 release ranks="all" final=true
1722005113.481615 free
1722005113.481647 clean

@garlick or @grondo , do you know what happened last night on elcap / under what cases the epilog wouldn't run?

@grondo
Copy link
Contributor

grondo commented Jul 26, 2024

We no longer run the job epilog on elcap, as we've transitioned to use the housekeeping service.
We know the job reached the cleanup state because the free and clean events are present.

I wonder if the problem is that there's a race at startup and the flux-coral2 jobtap plugin wasn't loaded at the time this job got the exception?

@jameshcorbett
Copy link
Member Author

The housekeeping service is a replacement for the administrative epilog right? But not a replacement for a jobtap epilog actions in general?

I wonder if the problem is that there's a race at startup and the flux-coral2 jobtap plugin wasn't loaded at the time this job got the exception?

Hmmm is that an expected race condition? If so I could maybe work to mitigate it.

@grondo
Copy link
Contributor

grondo commented Jul 26, 2024

I don't think it is expected, but perhaps something we didn't think about. I haven't verified that's the case BTW.

@grondo
Copy link
Contributor

grondo commented Jul 26, 2024

And now re-reading I see you were talking about the dws epilog, not the job manager/administrative epilog.
That is more evidence that jobtap plugins were not loaded when this exception occurred.

I'll open an issue in flux-core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants