-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: out of disk space #8473
Comments
testServerArgs will now take StoreSpecs instead of just a number of stores. This also adds StoreSpecsPerNode on testClusterArgs to enable per node store settings. Currently, only in-memory stores are supported. Part of work towards cockroachdb#8473.
testServerArgs will now take StoreSpecs instead of just a number of stores. This also adds ServerArgsPerNode on testClusterArgs to enable per server customizable settings. Currently, only in-memory stores are supported. Part of work towards cockroachdb#8473.
testServerArgs will now take StoreSpecs instead of just a number of stores. This also adds ServerArgsPerNode on testClusterArgs to enable per server customizable settings. Currently, only in-memory stores are supported. Part of work towards cockroachdb#8473.
There are 3 ways in which we will mitigate out of disk space errors.
I will create issues for each of these and close out this one. |
I'm surprised performing range-level flow-control isn't on that list. Rebalancing alone cannot prevent running into an out of disk space situation. PS We already of alerting on our test clusters for low-disk space. I'm not sure there is anything additional to do for 1. |
For the metrics, I'm going to look into the current ones and make sure we also surface them to our admin ui. Or improve them if needed. Range-level flow-control is required, but I was looking for tractable changes that we can do right now. I'll add an issue for that as well. Also, after chatting with @bdarnell about this, I think he's correct in saying we should not be doing any type of replica/range freezes or other fancy solutions. Best to fail naturally to avoid any corruption. |
Probably worthwhile to consider out of disk space scenarios:
|
Is there any evidence that we corrupt data when we run out of disk space? I thought the problem was that a node crashed when that happened. RocksDB certainly shouldn't corrupt data if it can't write. |
There's only so much we can do here. By panicking we allow the admin time to clean up the disk or moving the data to a larger drive before starting the node back up.
Agreed. We already have safeguards in place for this, but It would be nice to have some of the other options, including the high/low watermark levels for rebalancing, but as of right now, I don't think they would really help. Moving to rebalancing based on size instead of range count would go a long way to fixing our issues.
I see reading data from a cluster that is over capacity as a non-issue at this point. We might be able to add a mechanism to do so from the command line or some other solution, but for now, my advice would be to copy the data to machines with more disk space and restart the cluster. This could easily be a ver 2 or 3 feature, but shouldn't be a goal for 1.0.
I went looking and but I can't find clear evidence. But I did find some things that depend on rocksdb that found after disk full errors their checksums were not correct (after recovery). RocksDB tends to just suggest freeing up space (facebook/rocksdb#919) But why even let it get that far? If we drain the node when close to full, it would ensure we do our best to not cross that line. Obviously, there's no guarantee that we would be able to stop rocksDB from panicking as we only have so much control over disk space. |
Some other thoughts on this issue. It would be really nice to know at the range level to prevent kv writes from a nearly full disk and just return a disk full error (at the sql level this could be the postgres error code I'd like to expand on the concept of rebalancing based on %free space left instead range count. For clarity, %free is calculated as the minimum of either what's left of the defined space quota on the store or the total free space in a node.
We can add some other types of thresholds to this as well, such as, as long as all stores are less than 50% used, just rebalance based on total bytes and only when we start using more than 50% of a store move to %free. This would keep everything even until there is some pressure on the system to become offset. But there are some complications here that will need to be addressed about zones. We need to calculate the mean %free per zone and not just of the overall system. But I think this is best left for future work. Also, as an unanswered question, do we still record metrics when we're full? |
I agree with moving back to percentage disk space used in principle. Previously, I believe we were looking at overall disk space used (everything on disk, including OS) instead of disk space used just by the store. So, if we use the store's live bytes, that should solve the problem we were originally seeing: that with small clusters (live bytes <<< size of OS and other files in bytes), no rebalancing was occurring at all. As always with rebalancing changes, please test with small clusters (these are where first impressions are formed) and larger clusters. We want the rebalancer to work smoothly as the cluster goes from no data to some large amount of data. I have a vague worry about using percentages but nothing concrete yet. I'll continue to think about that. Regarding metrics, I don't think we should do anything special for them for now, especially since pruning requires more Raft commands (meaning more disk space consumed). |
Panicking is a big hammer. Having the cluster grind to a halt when it runs out of disk space really isn't acceptable.
Are the current rebalancing heuristics actually a problem? I haven't seen any evidence of that yet. We already rebalance away from any node that is >95% full.
We might not be able to drain the node (the entire cluster is full). I don't see a point in panicking at some certain percent fullness vs panicking when we're completely out of space. I do see a large benefit in refusing writes at a certain fullness (i.e. entering a read+delete only mode). If we panic at 99% full, how do you rebalance ranges off the node? Consider this scenario, you start up a 3-node cluster writing data to it. The nodes fill up and then hit your panic threshold and crash. You add 2 additional nodes to the cluster but you can't actually restart it because the 3 existing nodes are already at capacity and panic every time they are started. No bueno.
Agreed, but my strong suspicion is that trying to address out-of-disk via rebalancing heuristics is the wrong tool. We undoubtedly need additional improvements to the rebalancing heuristics and will be improving them for the foreseeable future, but out-of-disk space requires a new mechanism to handle real-world scenarios of interest. |
It's not optimal, but I just don't think there is much to win here by over-engineering at this point. There are very obvious venues to iterate on as this becomes a real-world concern. Let's take trivial precautions now and run some acceptance test clusters with asymmetric disk and see how efficiently the rebalancer manages to get data off that node. Turning into a read-only cluster when the disk is full seems like such a fringe feature. To compare, this is how much Postgres cares. |
In the short-term, I also support a simple panic when we have critically
|
To be clear, I'm not arguing we should be doing something complex at this time. I think panicking when we're critically low on disk space is a bit strange. IMO, better to just run up to the out-of-disk limit. If we want to give the user the flexibility to recover from the out-of-disk situation then we need to reserve some disk space they can release. A threshold we panic at would need to be configurable so that we can disable it in order to allow an near-out-of-disk node to start up (otherwise the node is permanently wedged). Here is a docs only solution (we could bake this into the CockroachDB will crash when it runs out of disk space. As an administrator, it may be prudent to reserve a portion of disk space ahead of time so that there is space that can be freed up in an emergency. For example, the following command will reserve 5 GB of disk space in a file named
|
But I think it's inevitable, since even read-only operation requires renewing leases, etc. I think that panicking when nearly full is a better outcome than trying to run right up to the limit (in which case rocksdb will panic for us).
It's wedged until you can free up space, which is exactly what would happen if we let the space be completely exhausted. The difference is that having a little bit of slack which might help you free up that space (for example, we might have a CLI command to do a rocksdb compaction which might be able to free up space). Or you have room to compress some files instead of just deleting them, etc. +1 to recommending a ballast file that can be deleted in emergencies. |
I don't see much of a difference. Can you expand on why you think an early panic by us is better than waiting for rocksdb to panic?
This could be tricky given that a RocksDB compaction needs disk space in order to free up disk space.
Interesting, though I'd argue that this is exactly an area we should care about more than Postgres. Consider my earlier example where you start a 3-node cluster and then all the nodes become full. One of the promises of CockroachDB is easy/seamless scalability. It shouldn't take herculean efforts (e.g. copying the node data to bigger disks) to unwedge such a cluster. Rather, our story should be: add a few more nodes and the cluster will automatically be ready to use again. Note, I'm not arguing for something complex here, just a reasonable answer to this scenario. Using a ballast (I like that term) file might be sufficient: if your cluster is out of disk space, add additional capacity via additional nodes, delete the ballast files on existing nodes and restart the cluster. If cockroach managed the ballast files itself this could be done rather seamlessly. |
This is exactly why an early panic could be desirable. Panicking isn't ideal (and if rocksdb didn't already panic when out of disk space I probably wouldn't recommend it), but it's easy. |
If we believe that RocksDB compaction is a viable way to free up disk space, we need to verify how much free space that requires. I would guess it potentially needs several times that max sstable size (currently 128 MB), but it might need more. |
The ballast file is an interesting idea, but could we not achieve the same thing without the need to just hoard some disk space? If we panic (well quit, not panic) when we're at less than 1% free space but make this configurable via a command line flag on start. By being able to turn off this autoquit, it would save the extra space (assuming no other sources are using up disk space) and allow for compaction or allow the admin to add new storage and relieve the disk space pressure. I'm going to investigate if compaction is a way to free up disk space from an already running cluster. |
Currently, we perform a compactions at a rate of around 6 per 5 mins (on gamma) and 3 per 10 mins on rho. I very much doubt that we're going to gain a lot of space by forcing a compaction when we're near the disk space limit. If we only compacted every hour it might be the case, but not when we are compacting every min or so. |
The compaction metrics are measuring individual compactions involving a few sstables, not full compactions as would be performed from the command line. The individual compactions free up a relatively small amount of space. I don't think your current analysis is valid. What would be better is to take a snapshot from one of the nodes on gamma (i.e. copy the data directory), record the size of the directory (e.g. using
The ballast guarantees you have disk space you can free. Without reserving disk space in that way some process (inside Cockroach) may create data that we can't delete. With the ballast we are guaranteed that we'll panic before using all of the disk space and then have a known amount of disk space that can be freed. |
It's hard to support ideas like compaction or down-replication. They are non-trivial complications and worse, they either won't work at all or won't work for long. I think we'll put ourselves on a solid footing and can close this bug for now (and let @BramGruneir work on something else) if we:
This will keep us from pushing any node to its critical threshold if there is other space in the cluster. If the entire cluster is full, then we'll panic. But that seems like a good first step. |
The logic you propose is already present in the allocator. See I'm leaning towards doing either nothing in the near term to address this issue, or adding support for the ballast file. |
@dianasaur323 The main concern we have is around how to recover a cockroach node from the scenario where disks are full. What we observed in one of our test is that a node hangs indefinitely upon an ENOSPC error. That might have been fixed by #19287 but when testing with the fix, we ran in more serious correctness issues (possibly related to #16004 (comment).) Haven't gotten a chance to try our test with the fix for that. In another case, we ended up filling up a cockroach cluster to 100% on all nodes accidentally, and there was no way for us to recover from that situation, for example by truncating a table. We have to wipe the cluster and start over. My expectations around space issues are that:
The ballast file should then be resized to the original size, and all operations can resume. |
@garvitjuniwal thanks for the detailed write-up! This is very helpful. Let me circle up with some people internally to see if we can address some of these issues in the next release / propose some solutions. I know there is already ongoing work going on in some other issues as well, but I don't think they directly address this one. |
Ok, I'm back with some updates after speaking with @tschottdorf. With 1, I believe we crash now, so even though it's not necessarily pretty, it shouldn't hang anymore. With 2, this seems to be a good suggestion that could be combined with 4. We could probably think about automating this, but for now, we might have to go with a manual workaround before we have time to provide a better UX here. With 3, we have an open issue about that here: #19329 |
@dianasaur323, assigning this to you as I haven't been involved with in a long time. |
When a node is 95% full, does it transfer its leases away (and do other nodes stop transferring leases to it)? If so, it seems that we're handling this reasonably well. If a single node runs out of disk, it should try to get rid of both leases and data until it drops below the 95% treshold. The only potentially missing piece is alerting the operator to this fact. Seems like we'd find out if we roachtested this. |
I think #25051 gets close to testing this already, except it fills all nodes' disks and writes data that gets GCed fast enough. We could add a flavor of it that has some nodes with enough space, and writes non-gc'able data and makes sure that the full nodes don't run out of disk space. |
That's false as written. Disk fullness doesn't affect lease transfer decisions. It's true if you s/leases/replicas/. |
Can't be false, since it's a question! Thanks for answering. The way I was hoping this would work is that the leaseholder would decide to |
:)
Sorry, I didn't get that's what you were asking. Leaseholders do transfer their lease away if they want to remove themselves from the range: cockroach/pkg/storage/replicate_queue.go Lines 351 to 377 in 6d0c09a
|
Ok. Then, assuming we don't write too fast, that roachtest I'm suggesting above should work, and it's all a matter of making it actually work, correct? |
I would like for it to work. I wouldn't bet much on it given the questionable behavior I saw of rocksdb's compactions filling up the disk on otherwised unused nodes in #22387. |
Good point. Really need to figure out how to make |
folding into #7782 |
A node should fail gracefully when out of disk space.
The text was updated successfully, but these errors were encountered: