-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sbd checks activity of the totem #85
Conversation
This is definitely an interesting approach. |
IMHO, this is not disadvantage, but advantage. :) Too long sync-phase is a signal of deadly problem and the watchdog must fire. If using api which blocked on the sync-phase, the watchdog can control and monitor the longness of the sync-phase. But with votequorum-api the sync-phase can last forever, but watchdog will do nothing.
I planed use a totem timestamps, but in this PR I check incoming totem packages (?) or so, I am not sure. %) I just checked CMAP variables and look for a variable which will often advance in a case of normal work of corosync and totem and will constant in a case of the incorrect working.
Fine with correct timeouts.
You are a boss, I accept this. May be we have different opinions how to fix this problem, but obviously we are common in what must be final result. I hope you'll add correct watchdog checking for pacemaker too before RedHat 8 will be realised. |
Not trying to overrule you here or ignore your points. |
Yep, but with pacemaker checks this may be not bad, because the healthy of pacemaker is heavily depend on the healthy of corosync.
I am not checking for totem-timestamp advancing too. ;) Because there is not totem-timestamp in the CMAP and I don't know how to access totem-timestamp in corosync from sbd. I check CMAP
Well, thus if the real watchdog check of corosync will be a check of corosync selfcheck mechanism, it will be perfect. |
2 @wenningerk @jfriesse
This is not complete PR for production, because I have some misunderstanding with the conditional compilation directive. For instance, previously CMAP compiled in only in the case of "two node cluster" directive (?), while now CMAP used more widely. But in the CentOS 7 environment (RedHat 7?) this PR is compiled and worked fine.
This PR is good enough to demonstrate what I talked about in PR #83. Sbd now watchdog the server not only in the case when the corosync is frozen (by
killall -s STOP corosync
, for instance), but also when the corosync half worked (simulated, for instance, byifdown eth0
).