Reduce krakenFailThreshold? #75

tserong · 2014-12-09T12:58:18Z

Currently, if there haven't been any heartbeats for 15 minutes, the UI shows "Cluster Updates Are Stale. The Cluster isn't updating Calamari. Please contact Administrator". Can we lower this threshold to, say, five minutes, or even maybe two minutes (given the heartbeat interval is one minuet), to give the user quicker notice of failure?

The reason I'm asking is, if the entire cluster is dead (or, at least, if there aren't enough mons alive for the cluster to be quorate, thus providing no heartbeat information), ISTM it would be better if this failure were apparent sooner rather than later.

ChristinaMeno · 2015-01-09T15:51:00Z

@tserong I think we can probably tighten this up. cthulhu schedules heartbeats to run ever 10s. So the lower-bound for new heartbeat is 10s + processing time.
In the GUI this window is dependent on cluster_update_time_unix being different from Date.now();
both expressed as ms since epoch.

The cluster update time is sourced from the health sync_object which is subject to the favorite window, which in some cases, can be as high as 3-4 minutes. If things are working as expected even the favorite window should only be 30-40 seconds.

I would imagine that we could easily change this to 5min. We could try 1-2min and see what happens. What do you think?

tserong · 2015-01-20T07:53:32Z

@GregMeno apologies for the laggy reply.

If the favorite window gets as high as 3-4 minutes, is that indicative of some sort of problem? Because if so, I'd lean towards setting the fail threshold to 1-2min anyway, so that case becomes obvious (or obvious-ish). But if 3-4 minutes is expected sometimes, and isn't scary, then I guess 5min would be more sensible for the fail threshold.

tserong · 2015-02-06T05:41:18Z

Ping?

This means if the cluster loses quorum, it will take only 5 minutes for the "cluster updates are stale" warning to appear, rather than 15 minutes (see ceph#75 for discussion). Signed-off-by: Tim Serong <[email protected]>

This means if the cluster loses quorum, it will take only 5 minutes for the "cluster updates are stale" warning to appear, rather than 15 minutes (see ceph/calamari-clients#75 for discussion). Signed-off-by: Tim Serong <[email protected]> (cherry picked from commit 40dfe5b87d795dc620e226c6ad9272839073d11d)

tserong mentioned this issue Apr 10, 2015

Reduce krakenFailThreshold to 5 minutes (bnc#903007) #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce krakenFailThreshold? #75

Reduce krakenFailThreshold? #75

tserong commented Dec 9, 2014

ChristinaMeno commented Jan 9, 2015

tserong commented Jan 20, 2015

tserong commented Feb 6, 2015

Reduce krakenFailThreshold? #75

Reduce krakenFailThreshold? #75

Comments

tserong commented Dec 9, 2014

ChristinaMeno commented Jan 9, 2015

tserong commented Jan 20, 2015

tserong commented Feb 6, 2015