-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: sbd-cluster: periodically check corosync-daemon liveness #83
Fix: sbd-cluster: periodically check corosync-daemon liveness #83
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but is there reason not to just focus on #76? Are you thinking of doing both checks?
configure.ac
Outdated
CPPFLAGS="$CPPFLAGS $cmap_CFLAGS" | ||
fi | ||
if test $HAVE_votequorum = 0; then | ||
AC_MSG_NOTICE(No package 'votequorum' found) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "library" instead of "package", otherwise users may start looking for a distro package by that name
There was a lot of discussion how to derive the timeout in the cpg-case and votequorum-api is the only one that doesn't stall during sync-phase. |
93208e4
to
0de1425
Compare
Okey, looked like I am too late. I was busy with different job. I checked your patch. I checked with timeouts inspired by @jfriesse: May be this timeouts are not necessary with this PR, but I like reduced reaction time. With @jfriesse I'll get reaction time= sync_timeout(2s)+stonith_watchdog_timeout(10s=2*sbd_watchdog_timeout(5s))+time to switch -> roughly 15s Three tests.
So I checked by Well, this PR is better than nothing. At least one more failure ( |
Thanks for testing. |
@wenningerk yep, pacemaker. But pacemaker is another topic. I told about corosync because the problem was in corosync, but not in the pacemaker. The healthy check of corosync in this PR are not enough, to check health. :) But if talking abstractly about pacemaker, may be not necessary check every processes (pacemaker have a lot of it), but may be only one single result, in which all processes is involved. For instance in the case of external stonith test, it's somehow simple (it is not check every process of pacemaker), but efficient to fence failed node. |
Just wanted to stress the reappearing principle. |
@wenningerk well, I may suggest another idea, more simple. Every process (corosync, all processes of pacemaker, sbd, pcs, etc) at the end of main loop just put current timestamp somewhere, for instance in cman. And watchdog daemon (not necessary sbd, another single process program will be simpler) just check timestamps (in main loop) of all processes and rise alarm, if one of them is staled. And at the end of it's main loop, if all ok, send event to the watchdog/softdog. It will be much more simpler, then complex hierarchical checking.
I don't agree. What I saw, that the corosync now in case of loosing it's single net interface is tring to use broadcasts instead. But root of the problem is not loosing a single net interface, but that the corosync in case of repeatable error in the main loop just continue from the begin. And so it is repeatable loop of the same error with partial functionality of the corosync.
Yep, I have suggestion to this very case too. I am tryng to make patch. %) |
Of course you can go the alternative approach that each and every service registers a software-watchdog at a central instance with some data like timeouts and stuff. The central instance is more or less agnostic of what it is watching. There are several approaches (one of them watchdogd, or systemd) but none did really become the overall standard everybody wants to build against or would they meet all our requirements here afaik (when I checked last time e.g. systemd had a loop with a fixed number of iterations over all timeouted stuff so that a reliable reset would take ages ...).
Didn't say it is perfect. Just said things beyond a single heartbeat are to be fixed in the service.
|
The question is what may be considered as a heartbeat? I think something that demonstrate the whole health of a service. Or almost whole. :) At least checking that the main loop naturally come to the end of the loop, somehow, and was not broken or frozen in the middle. Btw, when we want to check liveness of the human, we check the beats of the heart muscle, but not the tone of the little toe muscle on the left leg. ;) |
If you want to stress the human analogy I'd rather go for checking consciousness as to assure that the human is still able assess his own health-state ;-) |
Okey, in this case you also must sure that the human not only in consciousness, but also don't have delirium or hallucinations. I put PR #85 to demonstrate what I am talking about. Test with halfworked corosync is passed. |
using votequorum_getinfo.
To have at least basic liveness-observation of the corosync daemon without
much dependencies to the rest of the code.
As this is all just done on the local node we can easily switch to using the
cpg-protocol pacemaker is already using once we have a solution for automatically
derived timout-setting as being addressed in
#76
There is as well still demand for simplified / robustified handling of all the
connections the cluster-watcher meanwhile has to corosync (cpg, cmap, votequorum).
These issues are addressed in #81 #80.
Maybe this can be postponed till we have a way to tell if corosync-disconnection
was done gracefully triggered by e.g. systemd. In case of a non-graceful-disconneciton
we would assume that pacemaker has lost corosync-connection as well so that
we should rather suicide as quickly as possible.
In case of a graceful-disconnection we could wait for corosync to reappear again
without timeout - as we are doing on startup.
Handling could be done in a similar way as pacemaker-watcher is meanwhile
doing via differentiated behavior depending on if pacemaker disconnected
with running resources or without.