Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipcc: Have a few goes at tidying up after a dead server #434

Merged

Conversation

chrissie-c
Copy link
Contributor

This is an attempt to make sure that /dev/shm is cleaned up when a
server exits unexpectedly. Normally it's the server's responsibility
to tidy up sockets, but if it crashes or is killed with SIGKILL then
the client (us) makes a reasonable attempt to tidy up the server sockets
we have connected. The extra delay here just gives the server chance to
disappear fully. As a client we can get here pretty quickly but shutting
down a large server may take a little longer even when SIGKILLed.
The 1/100th of a second is an arbitrary delay (of course) but seems to
catch most servers in 2 tries or less.

See https://bugzilla.redhat.com/show_bug.cgi?id=1614166 for more info.
And yes, I'm expecting this to be controversial and anyone with better ideas is welcome.

@chrissie-c chrissie-c force-pushed the try-harder-to-close-server-sockets branch from 4b43ed8 to c7d2833 Compare January 22, 2021 10:20
Copy link
Member

@jfriesse jfriesse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to solve the problem with killing corosync (kill -9) and clients not cleaning the files in /dev/shm, so ACK from me.

@kgaillot
Copy link
Contributor

Looks good to me too.

BTW a separate issue for libqb is that usleep() is considered obsolete (no longer in the POSIX standard, which defines nanosleep()). I don't know of any platforms where it's a problem but it might crop up one day.

This is an attempt to make sure that /dev/shm is cleaned up when a
server exits unexpectedly. Normally it's the server's responsibility
to tidy up sockets, but if it crashes or is killed with SIGKILL then
the client (us) makes a reasonable attempt to tidy up the server sockets
we have connected. The extra delay here just gives the server chance to
disappear fully. As a client we can get here pretty quickly but shutting
down a large server may take a little longer even when SIGKILLed.
The 1/100th of a second is an arbitrary delay (of course) but seems to
catch most servers in 2 tries or less.
@chrissie-c chrissie-c force-pushed the try-harder-to-close-server-sockets branch from c7d2833 to 088a8af Compare January 25, 2021 11:01
@wferi
Copy link
Contributor

wferi commented Jan 25, 2021

If /dev/shm wasn't hard-coded (and if I understand the issue right), this could be neatly solved by the RuntimeDirectory systemd directive used by all IPC server units. As things stand, ExecStopPost could do the required cleanup instead. I'm not sure it has all required information (for example server PID) available, though...

@chrissie-c chrissie-c merged commit 991872e into ClusterLabs:master Jan 25, 2021
bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this pull request Oct 11, 2021
https://build.opensuse.org/request/show/924180
by user yan_gao + dimstar_suse
- Update to version 2.0.3+20210303.404adbc (v2.0.3):
- syslog: Add a message-id parameter for messages (gh#ClusterLabs/libqb#433)
- timers: Add some locking (gh#ClusterLabs/libqb#436)
- ipcc: Have a few goes at tidying up after a dead server (gh#ClusterLabs/libqb#434)
- strlcpy: Check for maxlen underflow (gh#ClusterLabs/libqb#432)
- doxygen2man: fix printing of lines starting with '.' (gh#ClusterLabs/libqb#431)
- doxygen2man: ignore all-whitespace brief descriptions (gh#ClusterLabs/libqb#430) (forwarded request 924179 from yan_gao)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants