Fix: sbd-cluster: periodically check corosync-daemon liveness #83

wenningerk · 2019-06-05T13:32:32Z

using votequorum_getinfo.

To have at least basic liveness-observation of the corosync daemon without
much dependencies to the rest of the code.

As this is all just done on the local node we can easily switch to using the
cpg-protocol pacemaker is already using once we have a solution for automatically
derived timout-setting as being addressed in
#76

There is as well still demand for simplified / robustified handling of all the
connections the cluster-watcher meanwhile has to corosync (cpg, cmap, votequorum).
These issues are addressed in #81 #80.
Maybe this can be postponed till we have a way to tell if corosync-disconnection
was done gracefully triggered by e.g. systemd. In case of a non-graceful-disconneciton
we would assume that pacemaker has lost corosync-connection as well so that
we should rather suicide as quickly as possible.
In case of a graceful-disconnection we could wait for corosync to reappear again
without timeout - as we are doing on startup.
Handling could be done in a similar way as pacemaker-watcher is meanwhile
doing via differentiated behavior depending on if pacemaker disconnected
with running resources or without.

kgaillot

Looks good to me, but is there reason not to just focus on #76? Are you thinking of doing both checks?

kgaillot · 2019-06-07T19:11:17Z

configure.ac

+        CPPFLAGS="$CPPFLAGS $cmap_CFLAGS"
+    fi
+    if test $HAVE_votequorum = 0; then
+        AC_MSG_NOTICE(No package 'votequorum' found)


I would say "library" instead of "package", otherwise users may start looking for a distro package by that name

wenningerk · 2019-06-07T20:47:49Z

There was a lot of discussion how to derive the timeout in the cpg-case and votequorum-api is the only one that doesn't stall during sync-phase.
Finally using cpg to check should be enough. And actually we would already need a proper mechanism
to verify the timeout (like pacemaker is checking against watchdog-timeout) because if defined too
short we are already risking having too much delay in getting the quorum-update so that we are
risking split-brain.
So this is a compromise to have something that can be used now without much discussion and that
for already existing configurations doesn't introduce additional reboots even if those are using timeouts
that are actually too low to reliably prevent split brain.
If I got it right there are plans to have some support from corosync side so that the timing-issues can
be verified easier. Till then we can at least cover the case where corosync isn't responsive at all.

using votequorum_getinfo.

Splarv · 2019-06-14T12:30:00Z

Okey, looked like I am too late. I was busy with different job. I checked your patch. I checked with timeouts inspired by @jfriesse:
#76 (comment)
but with sync_timeout=2000 (not important).

May be this timeouts are not necessary with this PR, but I like reduced reaction time. With @jfriesse I'll get reaction time= sync_timeout(2s)+stonith_watchdog_timeout(10s=2*sbd_watchdog_timeout(5s))+time to switch -> roughly 15s
But without @jfriesse timeouts I'll get sync_timeout 30s by default, reaction time will be on 28s longer.

Three tests.

Power off one node -> passed fine.
killall -s STOP corosync -> passed fine, failed node is watchdoged.
Sometimes can be wrong behaviour of the corosync, when corosync continuously get error in the main loop and begin loop from the begin. Can be easily simulated by ifdown eth0 (if there is only one net device in corosync.conf). But this is only one case of the problem, may be this incorrect behaviour can be achieved by other errors. In such case corosync looked like running, but all services, pacemaker, for instance, is frozen.

So I checked by ifdown eth0. In such case I didn't get desirable effect. Yep, master was roused on an other node and on that node failed node was marked as "offline". But the failed node was not watchdoged, on failed node pacemaker is frozen, pcs mon show thats all "Ok" (node is healthy in the totally healthy cluster of 2 nodes), postgresql is still in master mode and accept transactions on local connections.

Well, this PR is better than nothing. At least one more failure (killall -s STOP corosync) now correctly processed.

wenningerk · 2019-06-14T12:54:43Z

@Splarv:

Thanks for testing.
I guess it can't be the job of sbd to catch all kinds of errors/misbehavior watched services might make.
A little bit exaggerated that would mean to reimplement all these services within sbd ... ;-)
Thus the approach rather is to have a heartbeat check to one central component of each of these services. So that this central component can do further checking within this process and if that component is somehow stuck or failing the heartbeat to sbd will trigger.
This is what we are planning with pacemaker as well:
Currently we are just heartbeating to the cib-process as we are grabbing the node-status from there.
But of course this is not enough. If on the DC you would SIGSTOP crmd nobody would actually find out.
But it doesn't make sense to implement a heartbeat between sbd and each and every daemon within pacemaker. One process is monitored and this one has to do necessary further observation.
Of course this all isn't trivial with all the knowledge the services have about themselves. And it would be even less trivial for sbd bloating it on top and making it harder to understand and thus more error prone.
You can see it as well as an extension of the concept sbd already has internally:
One simple process (inquisitor) that is watching the other watcher-processes. Like this we try to build a hierarchy of watchers with the intention of keeping each one as simple as possible.

Splarv · 2019-06-14T13:16:43Z

@wenningerk yep, pacemaker. But pacemaker is another topic. I told about corosync because the problem was in corosync, but not in the pacemaker. The healthy check of corosync in this PR are not enough, to check health. :)

But if talking abstractly about pacemaker, may be not necessary check every processes (pacemaker have a lot of it), but may be only one single result, in which all processes is involved. For instance in the case of external stonith test, it's somehow simple (it is not check every process of pacemaker), but efficient to fence failed node.

wenningerk · 2019-06-14T14:00:06Z

@wenningerk yep, pacemaker. But pacemaker is another topic. I told about corosync because the problem was in corosync, but not in the pacemaker. The healthy check of corosync in this PR are not enough, to check health. :)

But if talking abstractly about pacemaker, may be not necessary check every processes (pacemaker have a lot of it), but may be only one single result, in which all processes is involved. For instance in the case of external stonith test, it's somehow simple (it is not check every process of pacemaker), but efficient to fence failed node.

Just wanted to stress the reappearing principle.
One heartbeat per service basically. Everything else is up to the service.
And if it is not enough we should try to fix within the service. (as e.g. done with the ifdown issue in corosync)
Exceptions are of course possible if it doesn't bind sbd too closely to the service and a change in the service is more complicated or not likely to happen out of other reasons.
So if you have a suggestion for this very case ...

Splarv · 2019-06-14T14:30:52Z

@wenningerk well, I may suggest another idea, more simple. Every process (corosync, all processes of pacemaker, sbd, pcs, etc) at the end of main loop just put current timestamp somewhere, for instance in cman. And watchdog daemon (not necessary sbd, another single process program will be simpler) just check timestamps (in main loop) of all processes and rise alarm, if one of them is staled. And at the end of it's main loop, if all ok, send event to the watchdog/softdog. It will be much more simpler, then complex hierarchical checking.

as e.g. done with the ifdown issue in corosync

I don't agree. What I saw, that the corosync now in case of loosing it's single net interface is tring to use broadcasts instead. But root of the problem is not loosing a single net interface, but that the corosync in case of repeatable error in the main loop just continue from the begin. And so it is repeatable loop of the same error with partial functionality of the corosync.

So if you have a suggestion for this very case ...

Yep, I have suggestion to this very case too. I am tryng to make patch. %)

wenningerk · 2019-06-14T14:52:47Z

@wenningerk well, I may suggest another idea, more simple. Every process (corosync, all processes of pacemaker, sbd, pcs, etc) at the end of main loop just put current timestamp somewhere, for instance in cman. And watchdog daemon (not necessary sbd, another single process program will be simpler) just check timestamps (in main loop) of all processes and rise alarm, if one of them is staled. And at the end of it's main loop, if all ok, send event to the watchdog/softdog. It will be much more simpler, then complex hierarchical checking.

Of course you can go the alternative approach that each and every service registers a software-watchdog at a central instance with some data like timeouts and stuff. The central instance is more or less agnostic of what it is watching. There are several approaches (one of them watchdogd, or systemd) but none did really become the overall standard everybody wants to build against or would they meet all our requirements here afaik (when I checked last time e.g. systemd had a loop with a fixed number of iterations over all timeouted stuff so that a reliable reset would take ages ...).
And you have to take into account that all services have their startup/shutdown specialties.
Sbd goes the other way as to use generic interfaces of the daemons.
Both have their pros and cons and with what sbd needs/does above being a watchdog-provider the way sbd is going seems to be the easier one.
Hierarchical checking is simple and the code moved to the services is useful as well beyond sbd-scenarios.

as e.g. done with the ifdown issue in corosync

I don't agree. What I saw, that the corosync now in case of loosing it's single net interface is tring to use broadcasts instead. But root of the problem is not loosing a single net interface, but that the corosync in case of repeatable error in the main loop just continue from the begin. And so it is repeatable loop of the same error with partial functionality of the corosync.

Didn't say it is perfect. Just said things beyond a single heartbeat are to be fixed in the service.

So if you have a suggestion for this very case ...

Yep, I have suggestion to this very case too. I am tryng to make patch. %)

Splarv · 2019-06-14T15:45:36Z

@wenningerk

Didn't say it is perfect. Just said things beyond a single heartbeat are to be fixed in the service.

The question is what may be considered as a heartbeat? I think something that demonstrate the whole health of a service. Or almost whole. :) At least checking that the main loop naturally come to the end of the loop, somehow, and was not broken or frozen in the middle.

Btw, when we want to check liveness of the human, we check the beats of the heart muscle, but not the tone of the little toe muscle on the left leg. ;)

wenningerk · 2019-06-17T11:58:16Z

to check liveness of the human, we check the beats of the heart muscle, but not the tone of the little toe muscle on the left leg. ;)

If you want to stress the human analogy I'd rather go for checking consciousness as to assure that the human is still able assess his own health-state ;-)

Splarv · 2019-06-19T08:34:44Z

If you want to stress the human analogy I'd rather go for checking consciousness as to assure that the human is still able assess his own health-state ;-)

Okey, in this case you also must sure that the human not only in consciousness, but also don't have delirium or hallucinations.

I put PR #85 to demonstrate what I am talking about. Test with halfworked corosync is passed.

kgaillot reviewed Jun 7, 2019

View reviewed changes

wenningerk added 2 commits June 11, 2019 08:02

Fix: sbd-cluster: periodically check corosync-daemon liveness

1387ed8

using votequorum_getinfo.

build: say library when missing cmap not package to avoid confusion

0de1425

wenningerk force-pushed the check_daemon_connection_quorum branch from 93208e4 to 0de1425 Compare June 11, 2019 06:09

wenningerk merged commit 398628b into ClusterLabs:master Jun 12, 2019

Splarv mentioned this pull request Jun 19, 2019

Sbd checks activity of the totem #85

Closed

wenningerk mentioned this pull request Aug 22, 2019

cluster-servant: only report online if quorate or never had quorum #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: sbd-cluster: periodically check corosync-daemon liveness #83

Fix: sbd-cluster: periodically check corosync-daemon liveness #83

wenningerk commented Jun 5, 2019

kgaillot left a comment

kgaillot Jun 7, 2019

wenningerk commented Jun 7, 2019 •

edited

Loading

Splarv commented Jun 14, 2019

wenningerk commented Jun 14, 2019

Splarv commented Jun 14, 2019

wenningerk commented Jun 14, 2019

Splarv commented Jun 14, 2019

wenningerk commented Jun 14, 2019

Splarv commented Jun 14, 2019 •

edited

Loading

wenningerk commented Jun 17, 2019

Splarv commented Jun 19, 2019

Fix: sbd-cluster: periodically check corosync-daemon liveness #83

Fix: sbd-cluster: periodically check corosync-daemon liveness #83

Conversation

wenningerk commented Jun 5, 2019

kgaillot left a comment

Choose a reason for hiding this comment

kgaillot Jun 7, 2019

Choose a reason for hiding this comment

wenningerk commented Jun 7, 2019 • edited Loading

Splarv commented Jun 14, 2019

wenningerk commented Jun 14, 2019

Splarv commented Jun 14, 2019

wenningerk commented Jun 14, 2019

Splarv commented Jun 14, 2019

wenningerk commented Jun 14, 2019

Splarv commented Jun 14, 2019 • edited Loading

wenningerk commented Jun 17, 2019

Splarv commented Jun 19, 2019

wenningerk commented Jun 7, 2019 •

edited

Loading

Splarv commented Jun 14, 2019 •

edited

Loading