Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a force GC / RocksDB compaction option for freeing up space #19329

Closed
spencerkimball opened this issue Oct 18, 2017 · 19 comments
Closed
Assignees
Milestone

Comments

@spencerkimball
Copy link
Member

After a table is truncated or deleted in order to free up space, we will also require a way to force garbage collection and RocksDB compaction. Currently this takes 24h, which will likely be a frustrating experience for users who are close to the space limit.

@spencerkimball
Copy link
Member Author

+cc @garvitjuniwal

@tbg
Copy link
Member

tbg commented Oct 25, 2017

Not saying that this is a sufficiently streamlined way of doing so, but couldn't you just change the zone configs to a TTL of 1s and have the data be removed much earlier than the 25h that way?

@spencerkimball
Copy link
Member Author

That’s a reasonable approach. Best to use existing pathways. The devil would be in the details of sql syntax and whatever degree of progress reporting we’d opt for.

@a-robinson
Copy link
Contributor

We'd also need to force a RocksDB compaction if the situation is really urgent.

@dt
Copy link
Member

dt commented Oct 25, 2017

exploring explicit compactions is also on RESTORE/IMPORT backlog, since on success we generate many small files and on failure, we want a way to clean up the orphaned, imported data to free up space

@tbg
Copy link
Member

tbg commented Nov 15, 2017

a DROP/TRUNCATE TABLE will result in GC runs that essentially clear the whole range (bringing the stats close to zero). It seems reasonable that GC could trigger a compaction in such cases.

@dianasaur323 dianasaur323 added this to the 1.2 milestone Nov 18, 2017
@dianasaur323
Copy link
Contributor

dianasaur323 commented Nov 18, 2017

I stuck this under 1.2 for now, but feel free to move around if you feel that that timeline isn't necessary / doesn't make sense.

@tbg
Copy link
Member

tbg commented Dec 5, 2017

Running some experiments on DROP TABLE this week. Number one is current master (post most of @spencerkimball's GC changes). I imported tpch.lineitem at scale factor 5 into a single-node cluster, which occupies around 6.6GB on my disk. I then ran DROP TABLE tpch.lineitem.

image

At the time of this screenshot, actual disk usage is starting to dropping below 4.4GB (with a peak of 7.7 sometime earlier).

The GC queue is making steady progress, though it's not overly effective since it has to do too much work per key to really breeze through.

image

On the plus side though, the system seems stable. I saw one or two context timeouts in the logs but nothing else. Take that with a grain of salt though; I'm not running any load and it's single-node.

The node maxes out my laptop's CPU (2 cores times 2 (hyperthreading)).
Memory use is reasonable.

image

I would declare this (particular experiment) "stable but much slower than I'd want it to be". I'm prototyping some improvements and will report back.

This is also running with a small RocksDB cache size, which might make some difference. I'll leave it small though.

@tbg
Copy link
Member

tbg commented Dec 5, 2017

I restarted the node before the operation finished. I can see the schema change continue, but it seems to have been restarted from the beginning (as in, we're dropping lots of chunks that are probably already gone). Since we're also aggressively slowing pausing in this process (I think 30s, judging by the traces), it'll take an hour plus until we actually continue to remove data. That is also unfortunate, and we may have to address it sooner rather than later.

@tbg
Copy link
Member

tbg commented Dec 5, 2017

I ran an experiment that looks promising. Essentially instead of going through MVCC, we slap RocksDB deletion tombstones on each range. The code is in https://github.com/tschottdorf/cockroach/tree/experiment/fastdrops. Instead of sending chunks of DeleteRange, the schema changer issues a single ClearRangeRequest encompassing the whole table. This is distributed by DistSender and, when executed, simply lays down a (RocksDB) range deletion tombstone that removes all user data. This is blazing fast (because it doesn't do any work; <1s to mark 6.6gb of on-disk data as deleted) and after a manual compaction (which takes around a minute), the data size drops to ~1mb. To productionize this, @spencerkimball and I were thinking the following (motivated by earlier brainstorming with @jordanlewis and @vivekmenezes):

  • need sql-level support to enable this fast path, tentatively DROP xyz FORCE as an initial API for this feature. This isn't powerful enough, but is straightforward to iterate on.
  • move calling ClearRange into the GC queue since that provides natural sequencing and rate limiting. This requires a mechanism for the GC queue to check whether the replica is "wipeable". @spencerkimball was considering using the existing zone config mechanism for this, details TBD.
  • the GC queue would also invoke a compaction on each such range (or at least suggest one to RocksDB).
  • MVCC stats need to be updated. I have this in the code but needed to comment it out as DistSender send 256 requests in parallel, so the stats computations grind to a halt (multi-minute each). I think we can be more efficient here by doing some math (i.e. computing the complement of user data which is typically small). This would also be less horrible once the computation happens on the queue.
  • WriteBatch needs to support range deletions. I think that is straightforward (cc @petermattis).
  • need to set GCThreshold in ClearRange and deal with other shortcomings of ClearRange (nontransactional but multi-range is not something we allow today, though that's )

Probably some more concerns I missed.

@vivekmenezes
Copy link
Contributor

As a follow up to this I spoke to @petermattis , @tschottdorf and @jordanlewis and everyone is cool with running the fast path deletion directly through a DROP TABLE without the FORCE clause.

In the future we will consider adding in a Trash basket or something which will postpone these kinds of drops to allow querying through AS OF SYSTEM TIME as well as fast recovery for a TTL period.

@dt
Copy link
Member

dt commented Dec 7, 2017

+cc @dianasaur323 w.r.t incremental backup and our point-in-time recovery story

@tbg
Copy link
Member

tbg commented Dec 7, 2017

As a follow up to this I spoke to @petermattis , @tschottdorf and @jordanlewis and everyone is cool with running the fast path deletion directly through a DROP TABLE without the FORCE clause.

Just for the record, I am OK with changing the way the deletion is run, not the point in time at which it is run. That is to say, DROP TABLE x would wait for the TTL and then ClearRange. I thought that was also what @petermattis agreed to yesterday, though he should speak for himself to avoid more confusion.

@petermattis
Copy link
Collaborator

Just for the record, I am OK with changing the way the deletion is run, not the point in time at which it is run. That is to say, DROP TABLE x would wait for the TTL and then ClearRange. I thought that was also what @petermattis agreed to yesterday, though he should speak for himself to avoid more confusion.

There were multiple conversations in multiple venues yesterday. My final words (in some PR) was that we should stick with the existing TTL in the short term and use ClearRange. Future work would could add various facilities to force a table to be dropped immediately or adjust the TTL on an already dropped table.

@tbg
Copy link
Member

tbg commented Dec 9, 2017

Did some weighing of our options regarding ways of getting the disk space freed. (Looks like we'll need this, though there's no need to tie it into the implementation of the ClearRange-DROP mechanism).

Essentially there is a choice between CompactRange and SuggestCompactRange. It looks like CompactRange has decent interactions with background compactions though (as in, it's aware of them), so it should be safe to call DBCompact, though I'm not sure if this compaction would be allowed to hog up more resources than RocksDB's background compactions (in which case looking at SuggestCompactRange is worth it).

CompactRange

To use CompactRange, we introduce a Node-level command CompactRange that passes through to its Stores via a command of the same name (ending up in DBCompact, i.e. ./cockroach debug compact) and that can be triggered from SQL via an experimental API such as crdb_internal.force_compaction(<node_id>, [<table_id>]). More experimentation is needed to decide whether calling this after a successful DROP TABLE (via ClearRange) is reasonable (or whether it can disrupt the cluster).

SuggestCompactRange

This is an experimental RocksDB command that "suggests" a key range to RocksDB's compaction scheduler. Unfortunately, it leaves the bottom-most SSTables alone (where typically you have most of the data), so it alone wouldn't succeed in releasing most of the space. This is discussed in facebook/rocksdb#1974 and it's the reason [mongo-rocks] has a separate compaction queue (that forces compactions via .CompactRange).

There doesn't be any difficulty in adding an option to that experimental API to change that behavior, though. What's nice about SuggestCompactRange is that it'd be OK to execute it on each range (all it does is mark SSTables as needing a compaction, so marking an SSTable multiple times through adjacent ranges is not a problem unless the compaction starts really quickly) as a side effect of ClearRange.

A more exotic not-quite-option:

DeleteFilesInRange

There is the command [DeleteFilesInRange] mentioned in this article and also used in [mongo-rocks]. It seems attractive at first glance: all SSTs containing no keys outside the given key range are simply thrown away. Unfortunately, this has bad interaction with snapshots and a distinct lack of atomicity with respect to the WriteBatch of the Raft command in which this command would be applied.

I still think we could apply this idea but it seems fairly involved. Some roadblocks include that this can't be done range-by-range (as most ranges don't fully contain a 128MB bottom-level SSTable), so this RPC needs to hit the store and basically tell it to run DeleteFilesInRange for the given span against all of its engines. This means that the Replicas have to be in a special state so that they don't freak out (consistency checker, not running, no open snapshots, etc). Also, we'd have to (before the deletion) send ClearRanges to the ranges and extract a promise that there won't be any write activity on the userspace keys we're going to drop (and bump the GCThreshold to avoid reads). Not worth looking at more at the moment.

@bdarnell
Copy link
Contributor

For DROP/TRUNCATE, we also have the problem that we leave empty ranges around forever. We could handle this by merging away the empty ranges after the deletion, but I wonder if it would make sense to do this in the other order: Edit the metadata descriptors to orphan the ranges to be deleted, then delete them through the replica GC path (which already uses ClearRange) instead of KV operations.

Now that I think about it, this probably isn't a great idea because it's not very reusable in other contexts (while a fast ClearRange and merging of empty ranges would be good for other reasons), but I wanted to put it on the table.

@tbg
Copy link
Member

tbg commented Dec 11, 2017

Re: triggering RocksDB compactions, I think we can get a decent enough signal using some combination of as (GetApproximateSize(from, to)) and ms (MVCCStats), but it's going to be a little ad-hoc.

For example, if ms.Total() = ms.KeyBytes + ms.ValBytes <= fudge * as, it's likely that much can be compacted away. fudge here accounts for the fact that perfectly compacted on-disk data is likely much smaller than ms.Total(). I think this is mostly prefix compression + snappy, and I'm not sure what the smallest realistic fudge is. Unfortunately it'll depend on how much of the data is in the key, and how effective prefix compression is. I imagine if you take a table that has a single string as a PK but mostly the inserted strings differ only at the very end, prefix compression will be very effective. The fudge factor could be computed on a per-table basis, though, taking that into account.

That said, I think we should hold off on this for the time being until we have a better signal on whether it's needed (and whether GetApproximateSize actually behaves as I think).

@bdarnell bdarnell modified the milestones: 2.0, 2.1 Feb 8, 2018
@petermattis
Copy link
Collaborator

@spencerkimball Is there anything left to do here now that we have the compaction queue?

@tbg
Copy link
Member

tbg commented Mar 28, 2018

It certainly doesn't yet work well, but I think we can close this issue as the remaining work is tracked in others, such as #24029.

@tbg tbg closed this as completed Mar 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants