Handling workflows after Flux restart #188

jameshcorbett · 2024-07-26T17:49:30Z

A kubernetes Workflow was stranded last night on elcap, and required manual intervention to remove, and I suspect it had something to do with the elcapi crash last night.

The flux-coral2 service creates Workflow objects and is responsible for destroying them. However, the trigger to destroy them is an RPC that is sent in a job.state.cleanup jobtap plugin callback. The same callback adds an epilog, but I don't see the epilog in the eventlog:

1722005113.437951 prolog-finish description="dws-setup" status=0
1722005113.440012 exception type="exec" severity=0 userid=... note="failed to create guest ns: No such file or directory"
1722005113.440154 release ranks="all" final=true
1722005113.481615 free
1722005113.481647 clean

@garlick or @grondo , do you know what happened last night on elcap / under what cases the epilog wouldn't run?

The text was updated successfully, but these errors were encountered:

grondo · 2024-07-26T18:04:42Z

We no longer run the job epilog on elcap, as we've transitioned to use the housekeeping service.
We know the job reached the cleanup state because the free and clean events are present.

I wonder if the problem is that there's a race at startup and the flux-coral2 jobtap plugin wasn't loaded at the time this job got the exception?

jameshcorbett · 2024-07-26T18:08:21Z

The housekeeping service is a replacement for the administrative epilog right? But not a replacement for a jobtap epilog actions in general?

I wonder if the problem is that there's a race at startup and the flux-coral2 jobtap plugin wasn't loaded at the time this job got the exception?

Hmmm is that an expected race condition? If so I could maybe work to mitigate it.

grondo · 2024-07-26T18:33:15Z

I don't think it is expected, but perhaps something we didn't think about. I haven't verified that's the case BTW.

grondo · 2024-07-26T18:35:38Z

And now re-reading I see you were talking about the dws epilog, not the job manager/administrative epilog.
That is more evidence that jobtap plugins were not loaded when this exception occurred.

I'll open an issue in flux-core.

grondo mentioned this issue Jul 26, 2024

job-manager: jobtap plugins may not be loaded early enough flux-framework/flux-core#6147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling workflows after Flux restart #188

Handling workflows after Flux restart #188

jameshcorbett commented Jul 26, 2024

grondo commented Jul 26, 2024

jameshcorbett commented Jul 26, 2024

grondo commented Jul 26, 2024

grondo commented Jul 26, 2024

Handling workflows after Flux restart #188

Handling workflows after Flux restart #188

Comments

jameshcorbett commented Jul 26, 2024

grondo commented Jul 26, 2024

jameshcorbett commented Jul 26, 2024

grondo commented Jul 26, 2024

grondo commented Jul 26, 2024