-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fault detection ping doesn't check for disk health #40326
Comments
Pinging @elastic/es-distributed |
Some initial thoughts: Can local disks become unresponsive as you describe, or is this uniquely a failure mode of network-attached storage? We generally recommend against network-attached storage. How exactly do you propose deciding that the disk has become unresponsive? I don't think removing such a broken node from the cluster is the right way to react here. If it were still running then it will keep trying to rejoin the cluster and I think this would be rather disruptive. I suspect the right reaction is to shut it down. I don't think that this needs any kind of distributed check. The node ought to be able to make this decision locally and react accordingly. Somewhat related to #18417. |
@DaveCTurner This has happened on i3.16xlarge (SSD-based instance storage) |
An initial proposal We can expose an API like |
Interesting.
This is the crux of the matter. How precisely do you propose to make this decision? |
For writes it could be if the |
Can you share the stats you collected from Also do you have an idea for implementing this portably? Your suggestion of |
We discussed this idea as a team and raised two questions:
|
No further feedback received. @Bukhtawar if you have the requested information please add it in a comment and we can look at re-opening this issue. |
Can't we leverage the lag detector (or along similar lines) send out periodic no-op cluster state updates, if there has't been an update(minutely or 5 minutely) yet so as to not overload the cluster. If the node with bad disk fails to apply the cluster state we can kick it out. |
It's impossible for us to say whether this would help without the further information requested above. "Doesn't trigger the lag detector" is a very weak health check. |
Some context: I noticed a problem(read-only) with the volume which continued for over 2hrs, but a cluster update was meanwhile not published, it happened after the first 40mins had elapsed and all this while requests were stalled on the problematic node. Only after the volume recovered did the node apply the cluster state update. The idea here was if there are no updates, master wouldn't be able to detect a disk which had turned read-only and requests could be stalled. While there are other approaches like reading and writing to file and checking on the latency but they suffer from false positives which could be due to long GC pauses. Another way could be the master could be initiating periodic writes on nodes and if the writes haven't been processed beyond a threshold(60s) master kicks the node out. Node joining back should validate that the writes goes through |
You certainly can't start a node on a readonly filesystem, so maybe a node should shut down if its filesystem becomes readonly while it's running. This very question is already on our agenda to discuss at some point in the future. I don't think we need to do anything as complicated as a cluster state update to check for a readonly filesystem, as I noted above:
The same is true of checking for IO latency: why bother the master with this at all? |
... and I'll ask again for answers to the questions we posed earlier. |
If most of the nodes were facing an outage (region-wide) then a cluster level decision becomes important. It should be more desirable to kick the node out if the cluster health doesn't go RED or some other health characteristics. I did try looking into how other systems behave but looks like these systems face a lot of false positives and operator intervention. |
The best part about Lag detector is it's more deterministic leaving lesser room for false positives. I am not saying that cluster state is the solution here but anything along the lines should definitely be helpful. Atleast for cases when issues are very obvious and still no remediation is being anticipated. |
@DaveCTurner just curious. How would the lag detector respond to read-only filesystems to the joining node. After kicking the node out of the cluster due to lagging state, the joining node would retry the join. The join validations on master and full join validations on node will not validate a read-only disk(maybe won't write anything to disk) as a result responding successfully to join validations causing master to update the cluster state with a node join. But then again the joining node would fail to update this state(it's own join) causing this to go in a loop. |
Yes, that's the issue, and the very argument for performing these checks locally. |
But isn't that an issue today with lag detector. Shouldn't this need a fix to avoid too many flip-flops once a disk goes read-only? Would it be better if join requests could persist the cluster state passed to it by master as a part of join validation before acking back? @DaveCTurner I hear you but the only point I am trying to make is taking a cluster wide decision (through master maybe)could help protect overall cluster health from going RED. Also would node start up operations always involve disk read/writes. I see there is a plan for better consistency checks as a part of #44624. Is that the reason your recommendation on local checks and shutdown won't need additional start-up checks(flip-flops won't happen if there is a clear disk failure eg: read-only) |
Yes indeed, that's why it's on our agenda to discuss.
I don't follow. If the cluster cannot accept writes to some shards then RED is surely the correct health to report?
Yes, you cannot start up a node on a readonly filesystem. |
Would shard relocation not work if the disk is read-only. I guess it would be more ideal to relocate shards-off as a best effort before shutting down the node to avoid running into a RED state |
A readonly filesystem on a data node is unsupported situation, although I will admit that Elasticsearch's behaviour if the filesystem goes readonly could be better-defined than it is today. We don't really expect anything to work in such a case. Elasticsearch expects to be able to write to disk on the source node (the primary) during a peer recovery, and may fail the shard if it discovers it cannot do so. This conversation started out discussing local disks becoming readonly, but now you seem to be concerned with outages affecting multiple nodes in a region-wide fashion. Can you explain more clearly how you can have a whole region's worth of local disks go readonly at the same time? Can you also answer the outstanding question about why your IO subsystem was hanging rather than timing out and returning an error when the local disk became unresponsive? |
Thanks @DaveCTurner for the detailed explanation
Apologies I wasn't clear, what I actually meant was multiple nodes(maybe an AZ not the entire region) can face outages at the same time. In cases where the cluster isn't zone aware/zone balanced, it would be more desirable to remediate one node at a time possibly to not cause a RED cluster for read intensive workloads(I am assuming reads should go through)
i noticed it was something like |
@DaveCTurner I was thinking having a similar check on node joins should help #16745. Thoughts |
Problem
The fault detection pings are light-weight and mostly check for network connectivity across nodes to kick the nodes that are not reachable out of the cluster. In some cases one of the disk can become unresponsive in which case some APIs like
/_nodes/stats
might get stuck. The ongoing writes can be stalled till this state is detected by some other means and disk replaced. I believe the lag detector with 7.x would be the first to figure this out and remove the node from the cluster state but only if there has been a cluster state update. Could this be detected upfront with deeper health checksThe text was updated successfully, but these errors were encountered: