Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crashing with running jobs can lead to problems #4862

Closed
garlick opened this issue Jan 17, 2023 · 1 comment
Closed

crashing with running jobs can lead to problems #4862

garlick opened this issue Jan 17, 2023 · 1 comment

Comments

@garlick
Copy link
Member

garlick commented Jan 17, 2023

Problem: if the flux rank 0 broker crashes with running jobs and then restarts, the fluxion scheduler may fail when processing the hello protocol.

2023-01-17T16:10:17.222950Z sched-fluxion-qmanager.err[0]: hello: error loading R for id=202046916802905088: No such file or directory

To get fluxion loaded, those jobs have to be removed from the KVS.

Until we can handle recovering running jobs, we should probably just force these jobs into inactive state.

Caveat: tasks belonging to the job could still be running.

Edit: the fluxion bug that is triggered here is flux-framework/flux-sched#991

@garlick
Copy link
Member Author

garlick commented Jan 29, 2023

I think #4894 addresses the short term issue here so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant