Skip to content
This repository has been archived by the owner on May 6, 2021. It is now read-only.

Reduce krakenFailThreshold? #75

Open
tserong opened this issue Dec 9, 2014 · 3 comments
Open

Reduce krakenFailThreshold? #75

tserong opened this issue Dec 9, 2014 · 3 comments

Comments

@tserong
Copy link
Contributor

tserong commented Dec 9, 2014

Currently, if there haven't been any heartbeats for 15 minutes, the UI shows "Cluster Updates Are Stale. The Cluster isn't updating Calamari. Please contact Administrator". Can we lower this threshold to, say, five minutes, or even maybe two minutes (given the heartbeat interval is one minuet), to give the user quicker notice of failure?

The reason I'm asking is, if the entire cluster is dead (or, at least, if there aren't enough mons alive for the cluster to be quorate, thus providing no heartbeat information), ISTM it would be better if this failure were apparent sooner rather than later.

@ChristinaMeno
Copy link
Contributor

@tserong I think we can probably tighten this up. cthulhu schedules heartbeats to run ever 10s. So the lower-bound for new heartbeat is 10s + processing time.
In the GUI this window is dependent on cluster_update_time_unix being different from Date.now();
both expressed as ms since epoch.

The cluster update time is sourced from the health sync_object which is subject to the favorite window, which in some cases, can be as high as 3-4 minutes. If things are working as expected even the favorite window should only be 30-40 seconds.

I would imagine that we could easily change this to 5min. We could try 1-2min and see what happens. What do you think?

@tserong
Copy link
Contributor Author

tserong commented Jan 20, 2015

@GregMeno apologies for the laggy reply.

If the favorite window gets as high as 3-4 minutes, is that indicative of some sort of problem? Because if so, I'd lean towards setting the fail threshold to 1-2min anyway, so that case becomes obvious (or obvious-ish). But if 3-4 minutes is expected sometimes, and isn't scary, then I guess 5min would be more sensible for the fail threshold.

@tserong
Copy link
Contributor Author

tserong commented Feb 6, 2015

Ping?

tserong added a commit to SUSE/calamari-clients that referenced this issue Apr 9, 2015
This means if the cluster loses quorum, it will take only 5 minutes for
the "cluster updates are stale" warning to appear, rather than 15
minutes (see ceph#75 for
discussion).

Signed-off-by: Tim Serong <[email protected]>
tserong added a commit to SUSE/calamari-clients that referenced this issue Apr 10, 2015
This means if the cluster loses quorum, it will take only 5 minutes for
the "cluster updates are stale" warning to appear, rather than 15
minutes (see ceph#75 for
discussion).

Signed-off-by: Tim Serong <[email protected]>
tserong added a commit to SUSE/calamari-clients that referenced this issue Apr 10, 2015
This means if the cluster loses quorum, it will take only 5 minutes for
the "cluster updates are stale" warning to appear, rather than 15
minutes (see ceph#75 for
discussion).

Signed-off-by: Tim Serong <[email protected]>
tserong added a commit to SUSE/calamari-clients that referenced this issue Apr 10, 2015
This means if the cluster loses quorum, it will take only 5 minutes for
the "cluster updates are stale" warning to appear, rather than 15
minutes (see ceph#75 for
discussion).

Signed-off-by: Tim Serong <[email protected]>
tserong added a commit to SUSE/romana that referenced this issue Jun 25, 2015
This means if the cluster loses quorum, it will take only 5 minutes for
the "cluster updates are stale" warning to appear, rather than 15
minutes (see ceph/calamari-clients#75 for
discussion).

Signed-off-by: Tim Serong <[email protected]>
(cherry picked from commit 40dfe5b87d795dc620e226c6ad9272839073d11d)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants