Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sbd checks activity of the totem #85

Closed
wants to merge 1 commit into from

Conversation

Splarv
Copy link

@Splarv Splarv commented Jun 19, 2019

2 @wenningerk @jfriesse
This is not complete PR for production, because I have some misunderstanding with the conditional compilation directive. For instance, previously CMAP compiled in only in the case of "two node cluster" directive (?), while now CMAP used more widely. But in the CentOS 7 environment (RedHat 7?) this PR is compiled and worked fine.
This PR is good enough to demonstrate what I talked about in PR #83. Sbd now watchdog the server not only in the case when the corosync is frozen (by killall -s STOP corosync, for instance), but also when the corosync half worked (simulated, for instance, by ifdown eth0).

@wenningerk
Copy link

This is definitely an interesting approach.
Unfortunately the api being used should be stalled during sync-phase as well as cpg-api (which is why we tossed the approach using cpg-api in favor of votequorum-api as when qdevice is being used we have to assume quite high timeouts that wouldn't be covered with a 5s default).
Atm I don't see a scenario where this totem-timestamp-advance approach shouldn't work otherwise.
But some things are not that obvious unfortunately ... (see the qdevice issue ;-) )
As said I'd personally prefer similar stuff to be done inside corosync triggered by an api that isn't stalled during sync-phase. Corosync would know when it is in sync-phase and what is to be done differently then. On the heartbeat-call corosync could on top verify that it e.g. isn't stuck in sync-phase. Internal knowledge of the clusterstack-components needed inside sbd should be kept to a minimum (not least important for maintainability - the timestamp you are using atm e.g. has moved with corosync-3).

@Splarv
Copy link
Author

Splarv commented Jun 25, 2019

Unfortunately the api being used should be stalled during sync-phase as well as cpg-api (which is why we tossed the approach using cpg-api in favor of votequorum-api as when qdevice is being used we have to assume quite high timeouts that wouldn't be covered with a 5s default).

IMHO, this is not disadvantage, but advantage. :) Too long sync-phase is a signal of deadly problem and the watchdog must fire. If using api which blocked on the sync-phase, the watchdog can control and monitor the longness of the sync-phase. But with votequorum-api the sync-phase can last forever, but watchdog will do nothing.

Atm I don't see a scenario where this totem-timestamp-advance approach shouldn't work otherwise.

I planed use a totem timestamps, but in this PR I check incoming totem packages (?) or so, I am not sure. %) I just checked CMAP variables and look for a variable which will often advance in a case of normal work of corosync and totem and will constant in a case of the incorrect working.

But some things are not that obvious unfortunately ... (see the qdevice issue ;-) )

Fine with correct timeouts.

As said I'd personally prefer similar stuff to be done inside corosync triggered by an api that isn't stalled during sync-phase.

You are a boss, I accept this. May be we have different opinions how to fix this problem, but obviously we are common in what must be final result. I hope you'll add correct watchdog checking for pacemaker too before RedHat 8 will be realised.

@Splarv Splarv closed this Jun 25, 2019
@wenningerk
Copy link

Not trying to overrule you here or ignore your points.
I'm just trying to have in mind what it means to maintain an addition we make to sbd.
I agree with you that using the correct timeouts qdevice won't give us any issues regardless of which api we are using. It just turned out that it is a non-trivial issue to automatically derive them.
I never claimed that the votequorum-check now in is capable of detecting all possible issues corosync might have - but so ain't checking for totem-timestamp advancing (although better probably).
Aim is to just verify a part of corosync being functional that in turn should selfcheck as much as possible (well adapted to the current set of corosync-features and constraints) rather than doing it all from sbd.

@Splarv
Copy link
Author

Splarv commented Jun 25, 2019

I never claimed that the votequorum-check now in is capable of detecting all possible issues corosync might have

Yep, but with pacemaker checks this may be not bad, because the healthy of pacemaker is heavily depend on the healthy of corosync.

  • but so ain't checking for totem-timestamp advancing (although better probably).

I am not checking for totem-timestamp advancing too. ;) Because there is not totem-timestamp in the CMAP and I don't know how to access totem-timestamp in corosync from sbd. I check CMAP runtime.totem.pg.mrp.srp.orf_token_rx.

Aim is to just verify a part of corosync being functional that in turn should selfcheck as much as possible (well adapted to the current set of corosync-features and constraints) rather than doing it all from sbd.

Well, thus if the real watchdog check of corosync will be a check of corosync selfcheck mechanism, it will be perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants