ipcc: Have a few goes at tidying up after a dead server #434

chrissie-c · 2021-01-22T09:45:10Z

This is an attempt to make sure that /dev/shm is cleaned up when a
server exits unexpectedly. Normally it's the server's responsibility
to tidy up sockets, but if it crashes or is killed with SIGKILL then
the client (us) makes a reasonable attempt to tidy up the server sockets
we have connected. The extra delay here just gives the server chance to
disappear fully. As a client we can get here pretty quickly but shutting
down a large server may take a little longer even when SIGKILLed.
The 1/100th of a second is an arbitrary delay (of course) but seems to
catch most servers in 2 tries or less.

See https://bugzilla.redhat.com/show_bug.cgi?id=1614166 for more info.
And yes, I'm expecting this to be controversial and anyone with better ideas is welcome.

jfriesse

Seems to solve the problem with killing corosync (kill -9) and clients not cleaning the files in /dev/shm, so ACK from me.

kgaillot · 2021-01-22T15:22:05Z

Looks good to me too.

BTW a separate issue for libqb is that usleep() is considered obsolete (no longer in the POSIX standard, which defines nanosleep()). I don't know of any platforms where it's a problem but it might crop up one day.

This is an attempt to make sure that /dev/shm is cleaned up when a server exits unexpectedly. Normally it's the server's responsibility to tidy up sockets, but if it crashes or is killed with SIGKILL then the client (us) makes a reasonable attempt to tidy up the server sockets we have connected. The extra delay here just gives the server chance to disappear fully. As a client we can get here pretty quickly but shutting down a large server may take a little longer even when SIGKILLed. The 1/100th of a second is an arbitrary delay (of course) but seems to catch most servers in 2 tries or less.

wferi · 2021-01-25T11:36:16Z

If /dev/shm wasn't hard-coded (and if I understand the issue right), this could be neatly solved by the RuntimeDirectory systemd directive used by all IPC server units. As things stand, ExecStopPost could do the required cleanup instead. I'm not sure it has all required information (for example server PID) available, though...

https://build.opensuse.org/request/show/924180 by user yan_gao + dimstar_suse - Update to version 2.0.3+20210303.404adbc (v2.0.3): - syslog: Add a message-id parameter for messages (gh#ClusterLabs/libqb#433) - timers: Add some locking (gh#ClusterLabs/libqb#436) - ipcc: Have a few goes at tidying up after a dead server (gh#ClusterLabs/libqb#434) - strlcpy: Check for maxlen underflow (gh#ClusterLabs/libqb#432) - doxygen2man: fix printing of lines starting with '.' (gh#ClusterLabs/libqb#431) - doxygen2man: ignore all-whitespace brief descriptions (gh#ClusterLabs/libqb#430) (forwarded request 924179 from yan_gao)

chrissie-c force-pushed the try-harder-to-close-server-sockets branch from 4b43ed8 to c7d2833 Compare January 22, 2021 10:20

jfriesse approved these changes Jan 22, 2021

View reviewed changes

chrissie-c force-pushed the try-harder-to-close-server-sockets branch from c7d2833 to 088a8af Compare January 25, 2021 11:01

chrissie-c merged commit 991872e into ClusterLabs:master Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipcc: Have a few goes at tidying up after a dead server #434

ipcc: Have a few goes at tidying up after a dead server #434

chrissie-c commented Jan 22, 2021

jfriesse left a comment •

edited

Loading

kgaillot commented Jan 22, 2021

wferi commented Jan 25, 2021

ipcc: Have a few goes at tidying up after a dead server #434

ipcc: Have a few goes at tidying up after a dead server #434

Conversation

chrissie-c commented Jan 22, 2021

jfriesse left a comment • edited Loading

Choose a reason for hiding this comment

kgaillot commented Jan 22, 2021

wferi commented Jan 25, 2021

jfriesse left a comment •

edited

Loading