-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Make elastic search crash after Caused by: java.io.IOException: No space left on device
, rather than spamminng logs
#24299
Comments
I don't think that crashing the node is the best approach when there is no space left on device. The node can still serve read requests and if indices are removed/relocated (or in this case old log files are removed) then there should be sufficient space to handle write requests. In case of log files filling up, maybe ES should try to prevent this by capping the size of the log file, for example like is done for deprecation logging (with log4j's SizeBasedTriggeringPolicy)? |
Also, a disk might fill up because of a big merge. As soon as that fails, disk space could be freed up again allowing the node to continue working. |
In that case, it might be best to put a hold on the operations that are unable to complete. A queue, that will retry at a much lower rate, until operations are successful, would appease the concerns of being available for read and waiting for disk to be freed, without causing undue spam. For an exception that requires manual intervention, a retry rate of once a minute would be acceptable, in my opinion. |
I agree killing the node might not be the right thing here from a 10k feet view while I can see it as a viable solution for some use - cases. Out of the box I'd like us to rather detect that we are close to full disk and then simply stop write operations like merging and indexing. Yet, this still has issues since if we have a replica on the node we'd reject a write which would bring the replica out of sync in that moment. Disk-threshold allocation deciders should help here moving stuff away from the node but essentially killing it and leave the cluster would be the right thing to do here IMO. If we wanna do that we need to somehow add corresponding handlers in many places or we start with adding it into the engine and allow folks to opt out of it? I generally think we should kill nodes more-often in disaster situations instead of just sitting there and wait for more disaster to happen. |
Yeah, so you cannot write to the disk if it's full, but spamming the logs isn't the way to go. Could do a "read only" style mode where it just says it went into read only because of no disk space. there's no need to spam the logs; even with a cap |
This is not necessarily true, the logs could (and should!) be on a separate mount, they could have log rotation applied to them, etc. |
I agree but I'm unsure if disk-full qualifies as such a disaster situation since it's possible to recover. |
In a concurrent server application, there are likely many disk operations in flight, expecting a single message is not realistic. As others have mentioned, this situation is not completely fatal, and can be recovered from so crashing on disk-full should not be a first option. |
One option is to provide a few behaviors for the host to pick from. Fatal crash, start refusing writes, or something else. Something else to be considered is maybe preventative? It should be reasonable to check the space remaining daily, and if it goes below, say, 5%, log a severe warning. |
the question is what the recovery path is? Most likely we need to relocate shards but shouldn't the disk-threshold decider have taken care of this already? Once we are like at 99% there is not much we can do but failing? it's likely the most healthy option here, it tells the cluster to heal itself by allocating shards on other nodes, it notifies the users since a node died. The log message might be clear and we can refuse to start up until we have at lest 5% disk back? I kind of like this option the more I think about it. |
@s1monw You're starting to convince me. Here's another thought has occurred to me: if we keep the node alive, I don't think there's a lot that we should or can do (without a lot of jumping through hoops) about the disk-full log messages, so those are going to keep pumping out. If those log messages are being sent to a remote monitoring cluster, the disks on nodes on the remote monitoring cluster could be overwhelmed too and now you have two problems (remote denial of service on the monitoring cluster). This is an argument for dying. |
Another thing to consider with respect to dying is that operators are going to have their nodes set to auto-restart (we encourage this because of dying with dignity). If we fail startup when the disk is full, as we should if we proceed with dying when the disk is full, we will end up in an infinite retry loop that will also be spamming logs and we haven't solved anything. I discussed this concern with @s1monw and we discussed the idea of a marker file to track the number of retries and simply not start at all if that marker file is present and the count exceeds some amount. At this point, manual intervention is required but it already is for disk full anyway. |
Lots of things.
I'm sure I'll think of more things. I don't really like the idea of shooting the node if it runs out of disk space. I just have a gut feeling about it more than all the stuff I wrote. |
this one is interesting. if you have only one copy it will be unavailable until the node has enough disk to recover. But you won't loose data, it's still there. If you have a copy the cluster will try to allocate it elsewhere and we slowly heal the cluster or the node comes back quickly with more disk space. I think dropping out is a good option IMO?
the same goes for OOM or any other fatal/non-recoverable error. the question is do we treat disk full as non-recoverable. IMO yes, we won't recover from it, the cluster will be in trouble anyhow.
this is not true - we keep the data on disk until we allocated the replica on another node. this can take very long
this is a different problem which I agree we should tackle but it's unrelated to out of disk IMO
we don't have this option unless we switch all indices allocated on it read-only? that is pretty drastic and very error prone |
Right. I take this point back. Personally I don't think nodes should kill themselves but I'm aware that is asking too much. There are unrecoverable things OOMs and other bugs. We work hard to prevent these and they break things in subtle ways when they hit. If running out of disk is truly as unrecoverable as a Java OOM then we should kill the node but we need to open intensive efforts to make sure that it doesn't happen. Like we did with the circuit breakers. The disk allocation stuff doesn't look like it is enough. |
agreed we should think and try harder to make it less likely to get to this point. One thought I was playing with was to tell the primary that the replica has not enough space left and if so we can reject subsequent writes. Such and information can be transported back to the primary with the replica write responses. Once the primary is in such a state it will stay there until the replica tells it's primary it's in a good shape again so we can continue indexing. I really think we should push back to the user is stuff like this happens and we need to give the nodes some time to move shards away which is sometimes not possible due to allocation deciders or no space on other nodes. In such a case we can only reject writes. |
+1 |
@s1monw, if there is space between when we start moving shards off of a node and when we reject writes then this can be a backstop. I guess also we'd want the primary to have this behavior too, right? |
@nik9000 yes so we should accept all writes on replicas all the time we just need to prevent the primary to send them. So yes, the primary should also have such a flag |
Ah! Now I get it. For those of you like me that don't get it at first: the replication model in Elasticsearch dictates that if a primary has accepted a write but the replica rejects it then the replica must fail. Once failed the replica has to recover from the primary which is a fairly heavy operation. So replicas must do their best not to fail. In this context that means that the replica should absorb the write even though it is running out of space but it should tell the primary to reject further writes. We can tuck the "help I'm running out of space" flag into the write response. So we have to set the disk space percent quite a bit before we get full because it is super-asynchronous. |
@nik9000 yes that is what I meant... thanks for explaining it in other words again. |
A concern I have: consider a homogenous cluster with well-sharded data (these are not unreasonable assumptions). If one node is running low on disk space, then they are all running low on disk space. Killing the first node to run out of disk space will lead to recoveries on the other nodes in the cluster exacerbating their low disk issues. Shooting a node can lead to a cluster-wide outage. |
we spoke about this yesterday in a meeting but I want to add my response here anyway for completeness. I think in such a situation the watermarks will protect us since if a node is already high on disk usage we will not allocate shards on it. We also have a good notion of how big shards are for relocation so we can make good decisions here. That is not absolutely water tight but I think we are ok along those lines. We also spoke about a possible solution to the problem of continuing indexing when a node is under pressure disk space wise. The plan to tackle this issue is to introduce a new kind of index level cluster block that will be set automatically on all indices that have at least one shard on a node that is above the flood_stage (which is another setting that is set to |
Just giving @bleskes a ping here since he might be interested in this as well... |
Thx @s1monw . Indeed interesting.
|
I am convinced we should reject all writes and not throttle once we cross a certain line. Folks can raise that bar if they feel confident but lets not continue writing.
we are trying to not even get to this point. We try to prevent adding more data once we crossed the
my summary of our chats steps away from failing the node... we try to not even get to the point and make indices that are allocated on the node that crosses the |
I thought about this more and I agree. It's a simpler solution than slowing things down.
Ok. Good. Then there is no need offer alternatives :)
By adding an index level block which exclude writes, we block write operations on the reroute phase, even before they go into the replication phase. Everything that's beyond this point will be processed correctly on both replicas and primaries. I think we're good here. |
Today when we run out of disk all kinds of crazy things can happen and nodes are becoming hard to maintain once out of disk is hit. While we try to move shards away if we hit watermarks this might not be possible in many situations. Based on the discussion in elastic#24299 this change monitors disk utiliation and adds a floodstage watermark that causes all indices that are allocated on a node hitting the floodstage mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch individual indices read-write once the situation is sorted out. There is no automatic read-write switch once the node has enough space. This requires user interaction. The floodstage watermark is set to `95%` utilization by default. Closes elastic#24299
Today when we run out of disk all kinds of crazy things can happen and nodes are becoming hard to maintain once out of disk is hit. While we try to move shards away if we hit watermarks this might not be possible in many situations. Based on the discussion in #24299 this change monitors disk utilization and adds a flood-stage watermark that causes all indices that are allocated on a node hitting the flood-stage mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch individual indices read-write once the situation is sorted out. There is no automatic read-write switch once the node has enough space. This requires user interaction. The flood-stage watermark is set to `95%` utilization by default. Closes #24299
Woo! |
Can we not re-evaluate the blocks when disk frees up rather than letting the end user worry about blocks. |
Hi @Bukhtawar this is worth discussion. Will you file a new issue for it? |
@s1monw I am using v7.0.4 but still getting this error. Can you tell me which version to use so that ES doesn't become unresponsive when the disk is full? |
The issue mentions that when flood stage watermark will be breached, we will make indices read only with allowing delete and disabling merges. Wondering why merges were not disabled on read only index? Is it intentional?
|
Describe the feature: As per #20354, logs are spammed with
Caused by: java.io.IOException: No space left on device
when the disk space runs out. Why not make it leave a single message in the logs, then just crash? As the disk is full, the database cannot run anyway, so rather than spamming logs, just adapt a simpler approach.This is a production database, and I wouldn't expect my logs to look like that after waking up.
The text was updated successfully, but these errors were encountered: