Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rc3: exits immediately on command failure, resulting in many errors #2672

Closed
grondo opened this issue Jan 24, 2020 · 2 comments
Closed

rc3: exits immediately on command failure, resulting in many errors #2672

grondo opened this issue Jan 24, 2020 · 2 comments

Comments

@grondo
Copy link
Contributor

grondo commented Jan 24, 2020

Since rc3 is run with -e, it exits immediately on the first command failure. When flux-sched is installed along with flux-core, the qmanager-start rc1 script removes sched-simple in order to load the qmanager module, but this results in flux module remove sched-simple failing for every instance. Since rc3 then exits before unloading all other modules.

E.g. on fluke:

 grondo@fluke108:~$ flux start flux getattr size
1
2020-01-24T21:17:30.234811Z broker.err[0]: rc3: flux-module: cmb.rmmod sched-simple: No such file or directory
2020-01-24T21:17:30.235136Z broker.err[0]: Run level 3 Exited with non-zero status (rc=1) 0.0s
flux-broker: module 'cron' was not cleanly shutdown
flux-broker: module 'kvs-watch' was not cleanly shutdown
flux-broker: module 'job-exec' was not cleanly shutdown
flux-broker: module 'job-manager' was not cleanly shutdown
flux-broker: module 'userdb' was not cleanly shutdown
flux-broker: module 'job-ingest' was not cleanly shutdown
flux-broker: module 'aggregator' was not cleanly shutdown
flux-broker: module 'content-sqlite' was not cleanly shutdown
flux-broker: module 'job-info' was not cleanly shutdown
flux-broker: module 'kvs' was not cleanly shutdown
flux-broker: module 'barrier' was not cleanly shutdown

I can't recollect the rationale for running rc3 under bash -e. If there isn't a good reason, I propose we remove that flag. I think we want rc3 to try harder to run to completion.

This may be fixed by upcoming initialization changes. However, I would consider the current flux RPMs broken, so perhaps we should throw in a fix for this issue and generate some new packages.

grondo added a commit to grondo/flux-core that referenced this issue Jan 25, 2020
Problem: If any step in rc3 fails, the shell immediately exits,
bypassing the rest of the shutdown. At best this results in a slew
of errors about "module was not cleanly shutdown", but at worst
could result in data loss as persistence rules could be skipped.

Remove the -e option from the rc3 shebang line so that the rc3
script produces errors but attempts to run to completion.

Fixes flux-framework#2672
grondo added a commit to grondo/flux-core that referenced this issue Jan 25, 2020
Problem: If any step in rc3 fails, the shell immediately exits,
bypassing the rest of the shutdown. At best this results in a slew
of errors about "module was not cleanly shutdown", but at worst
could result in data loss as persistence rules could be skipped.

Remove the -e option from the rc3 shebang line so that the rc3
script produces errors but attempts to run to completion.

Fixes flux-framework#2672
@garlick
Copy link
Member

garlick commented Jan 26, 2020

Sorry to come into this late. This was my bad - the behavior of flux module remove changed in e42826e when mrpc was dropped. Before that, it reported errors on stderr but still exited with success. I stumbled across that and "fixed" it without thinking through these ramifications. Sorry about that.

The benefit of -e is to make the instance fail if something important like flushing the content cache to disk fails. Exiting immediately when anything goes wrong is not a great way to handle that, but ignoring all errors may be going too far. I'm fine with dropping the -e now and restoring some error handling as part of resolving #2650. Does that make sense?

@grondo
Copy link
Contributor Author

grondo commented Jan 26, 2020

Yes, that seems reasonable!

@mergify mergify bot closed this as completed in b7b694c Jan 26, 2020
@grondo grondo added this to the flux-core v0.15.0 milestone Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants