-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a force GC / RocksDB compaction option for freeing up space #19329
Comments
+cc @garvitjuniwal |
Not saying that this is a sufficiently streamlined way of doing so, but couldn't you just change the zone configs to a TTL of 1s and have the data be removed much earlier than the 25h that way? |
That’s a reasonable approach. Best to use existing pathways. The devil would be in the details of sql syntax and whatever degree of progress reporting we’d opt for. |
We'd also need to force a RocksDB compaction if the situation is really urgent. |
exploring explicit compactions is also on RESTORE/IMPORT backlog, since on success we generate many small files and on failure, we want a way to clean up the orphaned, imported data to free up space |
a |
I stuck this under 1.2 for now, but feel free to move around if you feel that that timeline isn't necessary / doesn't make sense. |
Running some experiments on At the time of this screenshot, actual disk usage is starting to dropping below 4.4GB (with a peak of 7.7 sometime earlier). The GC queue is making steady progress, though it's not overly effective since it has to do too much work per key to really breeze through. On the plus side though, the system seems stable. I saw one or two context timeouts in the logs but nothing else. Take that with a grain of salt though; I'm not running any load and it's single-node. The node maxes out my laptop's CPU (2 cores times 2 (hyperthreading)). I would declare this (particular experiment) "stable but much slower than I'd want it to be". I'm prototyping some improvements and will report back. This is also running with a small RocksDB cache size, which might make some difference. I'll leave it small though. |
I restarted the node before the operation finished. I can see the schema change continue, but it seems to have been restarted from the beginning (as in, we're dropping lots of chunks that are probably already gone). Since we're also aggressively slowing pausing in this process (I think 30s, judging by the traces), it'll take an hour plus until we actually continue to remove data. That is also unfortunate, and we may have to address it sooner rather than later. |
I ran an experiment that looks promising. Essentially instead of going through MVCC, we slap RocksDB deletion tombstones on each range. The code is in https://github.com/tschottdorf/cockroach/tree/experiment/fastdrops. Instead of sending chunks of
Probably some more concerns I missed. |
As a follow up to this I spoke to @petermattis , @tschottdorf and @jordanlewis and everyone is cool with running the fast path deletion directly through a In the future we will consider adding in a Trash basket or something which will postpone these kinds of drops to allow querying through |
+cc @dianasaur323 w.r.t incremental backup and our point-in-time recovery story |
Just for the record, I am OK with changing the way the deletion is run, not the point in time at which it is run. That is to say, |
There were multiple conversations in multiple venues yesterday. My final words (in some PR) was that we should stick with the existing TTL in the short term and use |
Did some weighing of our options regarding ways of getting the disk space freed. (Looks like we'll need this, though there's no need to tie it into the implementation of the Essentially there is a choice between CompactRangeTo use SuggestCompactRangeThis is an experimental RocksDB command that "suggests" a key range to RocksDB's compaction scheduler. Unfortunately, it leaves the bottom-most SSTables alone (where typically you have most of the data), so it alone wouldn't succeed in releasing most of the space. This is discussed in facebook/rocksdb#1974 and it's the reason [mongo-rocks] has a separate compaction queue (that forces compactions via There doesn't be any difficulty in adding an option to that experimental API to change that behavior, though. What's nice about A more exotic not-quite-option: DeleteFilesInRangeThere is the command [DeleteFilesInRange] mentioned in this article and also used in [mongo-rocks]. It seems attractive at first glance: all SSTs containing no keys outside the given key range are simply thrown away. Unfortunately, this has bad interaction with snapshots and a distinct lack of atomicity with respect to the I still think we could apply this idea but it seems fairly involved. Some roadblocks include that this can't be done range-by-range (as most ranges don't fully contain a 128MB bottom-level SSTable), so this RPC needs to hit the store and basically tell it to run |
For DROP/TRUNCATE, we also have the problem that we leave empty ranges around forever. We could handle this by merging away the empty ranges after the deletion, but I wonder if it would make sense to do this in the other order: Edit the metadata descriptors to orphan the ranges to be deleted, then delete them through the replica GC path (which already uses ClearRange) instead of KV operations. Now that I think about it, this probably isn't a great idea because it's not very reusable in other contexts (while a fast ClearRange and merging of empty ranges would be good for other reasons), but I wanted to put it on the table. |
Re: triggering RocksDB compactions, I think we can get a decent enough signal using some combination of For example, if That said, I think we should hold off on this for the time being until we have a better signal on whether it's needed (and whether |
@spencerkimball Is there anything left to do here now that we have the compaction queue? |
It certainly doesn't yet work well, but I think we can close this issue as the remaining work is tracked in others, such as #24029. |
After a table is truncated or deleted in order to free up space, we will also require a way to force garbage collection and RocksDB compaction. Currently this takes 24h, which will likely be a frustrating experience for users who are close to the space limit.
The text was updated successfully, but these errors were encountered: