-
Notifications
You must be signed in to change notification settings - Fork 59
Reduce krakenFailThreshold? #75
Comments
@tserong I think we can probably tighten this up. cthulhu schedules heartbeats to run ever 10s. So the lower-bound for new heartbeat is 10s + processing time. The cluster update time is sourced from the health sync_object which is subject to the favorite window, which in some cases, can be as high as 3-4 minutes. If things are working as expected even the favorite window should only be 30-40 seconds. I would imagine that we could easily change this to 5min. We could try 1-2min and see what happens. What do you think? |
@GregMeno apologies for the laggy reply. If the favorite window gets as high as 3-4 minutes, is that indicative of some sort of problem? Because if so, I'd lean towards setting the fail threshold to 1-2min anyway, so that case becomes obvious (or obvious-ish). But if 3-4 minutes is expected sometimes, and isn't scary, then I guess 5min would be more sensible for the fail threshold. |
Ping? |
This means if the cluster loses quorum, it will take only 5 minutes for the "cluster updates are stale" warning to appear, rather than 15 minutes (see ceph#75 for discussion). Signed-off-by: Tim Serong <[email protected]>
This means if the cluster loses quorum, it will take only 5 minutes for the "cluster updates are stale" warning to appear, rather than 15 minutes (see ceph#75 for discussion). Signed-off-by: Tim Serong <[email protected]>
This means if the cluster loses quorum, it will take only 5 minutes for the "cluster updates are stale" warning to appear, rather than 15 minutes (see ceph#75 for discussion). Signed-off-by: Tim Serong <[email protected]>
This means if the cluster loses quorum, it will take only 5 minutes for the "cluster updates are stale" warning to appear, rather than 15 minutes (see ceph#75 for discussion). Signed-off-by: Tim Serong <[email protected]>
This means if the cluster loses quorum, it will take only 5 minutes for the "cluster updates are stale" warning to appear, rather than 15 minutes (see ceph/calamari-clients#75 for discussion). Signed-off-by: Tim Serong <[email protected]> (cherry picked from commit 40dfe5b87d795dc620e226c6ad9272839073d11d)
Currently, if there haven't been any heartbeats for 15 minutes, the UI shows "Cluster Updates Are Stale. The Cluster isn't updating Calamari. Please contact Administrator". Can we lower this threshold to, say, five minutes, or even maybe two minutes (given the heartbeat interval is one minuet), to give the user quicker notice of failure?
The reason I'm asking is, if the entire cluster is dead (or, at least, if there aren't enough mons alive for the cluster to be quorate, thus providing no heartbeat information), ISTM it would be better if this failure were apparent sooner rather than later.
The text was updated successfully, but these errors were encountered: