-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: implement a queue for suggested compactions #20607
storage: implement a queue for suggested compactions #20607
Conversation
Ignore the first commit. It's just the fast drop path. Also, this needs unittests, but no point in writing them if this direction doesn't pan out. I have tested this on a local database with a couple hundred MiB, and it worked fine (and freed up the disk space too fast even really notice). |
This looks pretty good! I think we want to be conservative about Review status: 0 of 38 files reviewed at latest revision, 26 unresolved discussions, some commit checks failed. c-deps/libroach/db.cc, line 1677 at r2 (raw file):
NYC, but this block is so large that I think it'd look better after an inversion: if (max_level != db->rep->NumberLevels() - 1) {
// There are no sstables at the lowest level, so just compact the
// entire database. Due to the level_compaction_dynamic_level_bytes
// setting, this will only happen on very small databases.
return ToDBStatus(db->rep->CompactRange(options, NULL, NULL));
}
// all the nontrivial code I'm also doubtful that you really want to update c-deps/libroach/db.cc, line 1750 at r2 (raw file):
I think it's premature to use pkg/keys/keys.go, line 65 at r2 (raw file):
Does this computation make sense? The ascending encoding is escaping-based, isn't it? The buffer size here makes it look like it's length-encoded (which it seems you could get away with here, but it's not worth it). pkg/keys/printer.go, line 195 at r2 (raw file):
Not that it'll ever matter, but you can print pkg/keys/printer_test.go, line 51 at r2 (raw file):
only a nit, but you could use backticks instead of quotes and then you don't have to escape quotes within the string: https://play.golang.org/p/_lBx1kzjfF pkg/storage/engine/compactor.go, line 15 at r2 (raw file):
Make this a new package pkg/storage/engine/compactor.go, line 39 at r2 (raw file):
pkg/storage/engine/compactor.go, line 80 at r2 (raw file):
Shouldn't this be pkg/storage/engine/compactor.go, line 93 at r2 (raw file):
Doubt this should be pkg/storage/engine/compactor.go, line 95 at r2 (raw file):
pkg/storage/engine/compactor.go, line 103 at r2 (raw file):
if err != nil {
log.Warningf(..) //(unless you'd ever expect to see this during normal operations)
} else if !ok {
break
} pkg/storage/engine/compactor.go, line 109 at r2 (raw file):
Move this line up, so that it's also hit in the pkg/storage/engine/compactor.go, line 124 at r2 (raw file):
Make sure to open a trace at the caller ( pkg/storage/engine/compactor.go, line 130 at r2 (raw file):
FWIW, I think you should add the required method to the pkg/storage/engine/compactor.go, line 136 at r2 (raw file):
to manage type compaction struct {
enginepb.Compaction
StartKey, EndKey roachpb.Key
} and populate them in this loop. pkg/storage/engine/compactor.go, line 156 at r2 (raw file):
pkg/storage/engine/compactor.go, line 168 at r2 (raw file):
This is kinda hard to read, perhaps factor it out: shouldProcess := totalBytes >= thresholdBytes || totalBytes >= int64(float64(capacity.Used)*thresholdBytesFraction) || timeutil.Since(lastProcessed) >= thresholdTimeSinceLastProcess
if !shouldProcess {
log.Infof(ctx, "skipping compaction with %d suggestions, total=%db, used=%db, last processed=%s",
len(suggestions), totalBytes, capacity.Used, lastProcessed)
return false, nil
} pkg/storage/engine/compactor.go, line 175 at r2 (raw file):
Remove. pkg/storage/engine/compactor.go, line 176 at r2 (raw file):
I'd log after having processed a compaction, and include the duration and its index (and make that the only log message emitted by the compactor in the common case): tBegin = timeutil.Now()
// work
log.Infof(ctx, "processed compaction #%d/%d for %s in %s", i+1, len(suggestions), humanizeutil.IBytes(sc.Bytes), timeutil.Since(tBegin)) pkg/storage/engine/compactor.go, line 206 at r2 (raw file):
These can go. pkg/storage/engine/compactor.go, line 230 at r2 (raw file):
This would pick up pkg/storage/engine/compactor.go, line 246 at r2 (raw file):
but I'm not sure this should be fatal error. You would just unset the pkg/storage/engine/compactor.go, line 253 at r2 (raw file):
I would happily make that a warning. pkg/storage/engine/enginepb/compact.proto, line 23 at r2 (raw file):
This proto is stored at a key that encodes both start key and end key, so I don't think they need to contain the keys a second time. pkg/storage/engine/enginepb/compact.proto, line 23 at r2 (raw file):
nit: no trailing dot. pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file):
Won't be needed without Comments from Reviewable |
Reviewed 1 of 23 files at r1, 21 of 21 files at r2. c-deps/libroach/db.cc, line 1750 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
+1. Shouldn't rocksdb be able to figure this out itself when it compacts the range? pkg/roachpb/api.proto, line 260 at r2 (raw file):
Why is this different from ... OK, after finishing the rest of the review I see what you're trying to do here. Maybe call it pkg/sql/tablewriter.go, line 822 at r2 (raw file):
If this request gets split up by the DistSender, the header span will be truncated but this one won't. That could mean that we suggest a compaction for parts of the table that haven't been covered by a ClearRange yet. This is definitely unsafe with DeleteFilesInRange; with CompactRange instead it may be inefficient but at least it's not broken. pkg/storage/engine/compactor.go, line 39 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Why not pkg/storage/engine/compactor.go, line 47 at r2 (raw file):
Maybe we should compute this fraction using capacity.LogicalBytes instead of capacity.Used so we're doing apples-to-apples math. Otherwise, be clearer about how we're mixing units here: "Note that the numerator of this fraction is based on uncompressed data while the denominator is prefix- and snappy-compressed..." pkg/storage/engine/compactor.go, line 52 at r2 (raw file):
The relationship between the two time parameters is not clear from these comments. This whole process feels unnecessarily complex and reliant on magic numbers. I'm not sure that thresholdTimeSinceLastProcess makes sense - why should dropping a tiny table force a compaction after 2 hours (which I think implies rewriting at least one 128MB sstable)? If the table is too small, leave it for rocksdb to process naturally. pkg/storage/engine/compactor.go, line 107 at r2 (raw file):
This could race with a concurrent suggestion. We need to clear the timer before going into processSuggestions, so that any suggestions that arrive will queue up a new event. pkg/storage/engine/compactor.go, line 248 at r2 (raw file):
Isn't this double-counting? If the same compaction is suggested twice it should free up the same data. (after finishing the review, I see that I was wrong here, but it shows that this is surprising and needs a comment) pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file):
Is this preparing for the future? It looks like it's always set to true today. Comments from Reviewable |
Reviewed 23 of 23 files at r1. c-deps/libroach/db.cc, line 1750 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
That's a good question, I wonder what RocksDB will make of this. It would ideally make this about as efficient as calling pkg/sql/tablewriter.go, line 822 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Good point. Yet more evidence that we shouldn't be BTW, a straightforward fix for this is to issue a second pkg/storage/engine/compactor.go, line 39 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
:transcendentbrain: pkg/storage/engine/compactor.go, line 52 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I agree, though I think to trust that process we should catch up on when RocksDB actually schedules compactions. It's not clear to me, and we've definitely seen it leave databases that would shrink from 15G to 100M alone for at least a day. Comments from Reviewable |
Review status: all files reviewed at latest revision, 33 unresolved discussions, some commit checks failed. pkg/storage/engine/compactor.go, line 52 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Maybe we should introduce a CompactRange RPC (and admin command, which would work online instead of the current offline compaction command) to let you force a compaction whenever rocksdb gets it wrong, even if it's not tied to a table drop. In fact, maybe the schema changer could use that RPC instead of doing all this scheduling triggered from the ClearRange RPC. Comments from Reviewable |
Review status: all files reviewed at latest revision, 33 unresolved discussions, some commit checks failed. pkg/storage/engine/compactor.go, line 52 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I think that makes sense, though that will require making Comments from Reviewable |
66aa5cd
to
981fa29
Compare
OK, removed use of Review status: 13 of 45 files reviewed at latest revision, 33 unresolved discussions. c-deps/libroach/db.cc, line 1677 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
@petermattis? I've changed it to use start and end key. c-deps/libroach/db.cc, line 1750 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Will look into it, but for now removing pkg/keys/keys.go, line 65 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I don't believe that the computation makes sense. I'll just let this inefficiently re-allocate as necessary instead. pkg/keys/printer.go, line 195 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
No longer applicable. pkg/keys/printer_test.go, line 51 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/roachpb/api.proto, line 260 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Removed this. The idea was to have the full table range for allowing pkg/sql/tablewriter.go, line 822 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
It was not intended to be truncated. The idea was that I wanted to be able to call pkg/storage/engine/enginepb/compact.proto, line 23 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Removed. pkg/storage/engine/enginepb/compact.proto, line 23 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I'm still going to send this as a hint. It's pertinent and I have a feeling it will come in useful. pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Suggested pkg/storage/engine/compactor.go, line 15 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 39 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 47 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Using logical bytes instead. pkg/storage/engine/compactor.go, line 52 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I've changed things so that the size-related thresholds activate if any contiguous collection of suggested compaction spans exceeds them, and dropped the time-based metric. pkg/storage/engine/compactor.go, line 80 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 93 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
for testing...will remove. pkg/storage/engine/compactor.go, line 95 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 103 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 107 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/engine/compactor.go, line 109 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 124 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 130 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 136 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Needed this message struct anyway in pkg/storage/engine/compactor.go, line 156 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 168 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
All those infos were just to see what was going on when testing. Removed. pkg/storage/engine/compactor.go, line 175 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 176 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 206 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 230 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/compactor.go, line 246 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Now just using the most recent setting for the cleared bit. pkg/storage/engine/compactor.go, line 248 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
It shouldn't be. These are only meant to be suggested at time where we're committing a change to the underlying storage engine that actually clears bytes, and those are accounted for in the pkg/storage/engine/compactor.go, line 253 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. Comments from Reviewable |
c-deps/libroach/db.cc
Outdated
@@ -17,6 +17,7 @@ | |||
#include <google/protobuf/stubs/stringprintf.h> | |||
#include <mutex> | |||
#include <rocksdb/cache.h> | |||
#include <rocksdb/convenience.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer need this import
I only looked at the C++ bits. Mostly looks good. Reviewed 23 of 23 files at r1. c-deps/libroach/db.cc, line 1636 at r4 (raw file):
I'm mildly surprised you have to explicitly cast to c-deps/libroach/db.cc, line 1653 at r4 (raw file):
Using bottom-most level compactions for c-deps/libroach/db.cc, line 1665 at r4 (raw file):
c-deps/libroach/include/libroach.h, line 109 at r4 (raw file):
Let's reword this to not mention GetApproximateSize so the reader doesn't have to know what that method is about. Something like: Comments from Reviewable |
981fa29
to
429cf29
Compare
Review status: 13 of 45 files reviewed at latest revision, 36 unresolved discussions, some commit checks failed. c-deps/libroach/db.cc, line 1636 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. Looks like I didn't need the explicit casts. Removed. c-deps/libroach/db.cc, line 1653 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
As mentioned in offline conversation, the hope is that we send in huge spans after dropping a table – the compactor makes an effort to group together contiguous or near-contiguous key spans from the queued suggested compactions. Now the question is, do we want to dynamically avoid using c-deps/libroach/db.cc, line 1665 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. c-deps/libroach/include/libroach.h, line 109 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. Comments from Reviewable |
Reviewed 12 of 23 files at r1, 11 of 30 files at r3, 21 of 21 files at r4, 2 of 2 files at r5. c-deps/libroach/db.cc, line 1652 at r4 (raw file):
The "biggest reason" part of this comment is no longer true. pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):
Not just for stats updating: If the span is less than the entire range, we must not update the GCThreshold to the current time (and as a consequence, should we also fall back to MVCC-safe DeleteRange, or use ClearRange without any safeguards against later reads?) pkg/storage/compactor/compactor.go, line 43 at r4 (raw file):
This should probably be a cluster setting. I'm concerned that this is a new component that can run amok and having a quick way to more or less disable it would be a good idea. We should also consider debuggability for any new component like this: there should be metrics and ideally a debug page about its activity. Any monitoring work we don't do now will be debt that we'll eventually have to repay. pkg/storage/engine/compactor.go, line 52 at r2 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
What do you think about the CompactRange RPC suggestion? I think that would be a simpler way of reclaiming space after dropping a table, and I think I'm more comfortable with that than a new "moving piece" like this compaction queue. This queue is potentially more generalizable to things like reclaiming space after MVCC GC, but it's not yet clear to me how exactly that would work and whether we'll be doing it any time soon. Comments from Reviewable |
Review status: all files reviewed at latest revision, 36 unresolved discussions, some commit checks failed. c-deps/libroach/db.cc, line 1677 at r2 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Using c-deps/libroach/db.cc, line 1636 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
@tschottdorf Is adding an almost identical function in one of his PRs. You two should coordinate. c-deps/libroach/db.cc, line 1653 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
My instinct is that c-deps/libroach/db.cc, line 1690 at r5 (raw file):
Ditto for c-deps/libroach/include/libroach.h, line 109 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
See my other comment regarding @tschottdorf's addition of a similar function. Comments from Reviewable |
Review status: all files reviewed at latest revision, 36 unresolved discussions, some commit checks failed. c-deps/libroach/db.cc, line 1677 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
@petermattis can you elaborate? Is that because looking at any key span will essentially give you at least one SSTable at the bottom level? c-deps/libroach/db.cc, line 1636 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
I just merged mine. Sorry about the rebase, but should be straightforward. c-deps/libroach/db.cc, line 1653 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
@petermattis can you elaborate on the strategy here (ideally in comment form so that we can put it into the code)? What I hope to understand is that if you have the key range corresponding to a bottom-most SSTable (one of the steps in I guess that's easy enough to test by running a Comments from Reviewable |
I didn't deep-dive into the code, but I agree with @bdarnell that adding a Reviewed 30 of 30 files at r3, 19 of 21 files at r4, 2 of 2 files at r5. pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file): Previously, bdarnell (Ben Darnell) wrote…
With either of those, I'm worried that we might end up creating large proposals -- think above the Raft threshold size -- and so we need chunking, etc, which isn't available at this level. You're worried only about the case in which the table hasn't been split off from its sibling(s), right? I wonder if we can delay the drop until that has happened. But that's additional complexity... What if we returned an error here that would force the schema changer into the slow path? This would work well if the error occurs on the last range: the schema changer would retry, but all of the ranges are already empty, so many of the Third stream-of-consciousness-idea: we make "partial" I like that last idea because it's straightforward and should work well in practice. pkg/storage/compactor/compactor.go, line 43 at r4 (raw file): Previously, bdarnell (Ben Darnell) wrote…
+1 to metrics. Also generally +1 to avoiding amok, though we're starting to accumulate the large heap of knobs that we once thought we'd avoid. Perhaps it's time to think about separating out knobs we want users to stumble upon, and other "never really touch" knobs. I'd really like to be able to test a component like this sufficiently to be confident in its ability to perform well under various workloads. We'll hopefully get there in the coming months. pkg/storage/compactor/compactor.go, line 38 at r5 (raw file):
pkg/storage/compactor/compactor.go, line 46 at r5 (raw file):
pkg/storage/compactor/compactor.go, line 51 at r5 (raw file):
pkg/storage/compactor/compactor.go, line 264 at r5 (raw file):
This feels over-engineered and makes me less confident that this will work equally well in the multitude of scenarios out there. Just merge key spans and run compactions on the connected components. pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Useful for what? When you need it, you can just introduce it. Until then, it raises questions on how to merge these spans and what setting the flag does. It does nothing and there are no plans for that to change, so better to remove it. pkg/storage/engine/compactor.go, line 52 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I agree with Ben that it seems more appropriate at this point to issue an RPC. Comments from Reviewable |
429cf29
to
5f458c3
Compare
I still believe this is the right approach. See below for my response to Ben's suggestion. Review status: 8 of 23 files reviewed at latest revision, 18 unresolved discussions. c-deps/libroach/db.cc, line 1677 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
@petermattis, yes please elaborate... c-deps/libroach/db.cc, line 1636 at r4 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. c-deps/libroach/db.cc, line 1652 at r4 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. c-deps/libroach/db.cc, line 1653 at r4 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yes, I'm also not clear on this. My understanding is that if we don't force the bottommost range, we won't clear up space for old tables after dropping them. Could you clarify? c-deps/libroach/db.cc, line 1690 at r5 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. c-deps/libroach/include/libroach.h, line 109 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I think it's best to simply not update the GCThreshold if we run into the partial range case. This is supposed to be safe, according to the contract which this API call explicitly requires: that the key span being cleared is not to be read or written subsequent to the call. The update of the GCThreshold is just an attempt to get back errors which we can track down if someone down the road changes something in SQL land that breaks the contract. pkg/storage/compactor/compactor.go, line 43 at r4 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I added an environment variable so we can disable it. I've also added metrics, though I think a debug page is too much for this PR. pkg/storage/compactor/compactor.go, line 38 at r5 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor.go, line 46 at r5 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor.go, line 51 at r5 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor.go, line 264 at r5 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
But we don't know for sure things are connected because they often don't overlap. For example, dropping a large table with lots of nodes will surely connect few cleared ranges – we really need to take a look at what's between the gaps. This seems reasonable to me. Would you like to suggest a smaller threshold than 64M? I chose that because that's the amount we feel confident writing to disk and moving around on the network. Seems reasonable that we'd be willing to rewrite that amount in order to compact and reclaim 128M. pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
OK pkg/storage/engine/compactor.go, line 52 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
A CompactRange RPC has drawbacks which made me choose this approach instead.
The compactor has complexity (i.e. another goroutine, and merging of contiguous suggestions), but I think the CompactRange RPC alternative(s) have both complexity (i.e. some combination of: new RPC, new Raft Command, pseudo-queue, debouncing, all-nodes-commands, etc.) and less flexibility to handle space reclamation in scenarios which will clearly impact our users (e.g. migration, manual or automatic large-scale row deletion from SQL). Comments from Reviewable |
5f458c3
to
83d6d1e
Compare
Review status: 7 of 22 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed. c-deps/libroach/db.cc, line 1677 at r2 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
We specify Does that answer your questions? Perhaps I'm not understanding the concern here. c-deps/libroach/db.cc, line 1653 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Re-reading the RocksDB doc comments, I'm actually not sure what will happen if we don't force a bottom-most compaction. If you're concerned about freeing the space, then forcing a bottom-most compaction seems safest. Might be worth experimenting and then documenting whether it is necessary or not. c-deps/libroach/db.cc, line 1690 at r5 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Did you forget to push this change? Comments from Reviewable |
I added another comment to that discussion. TL;DR is that I still think the RPC is the right approach (at least for our upcoming release). The queue might win out in the long run, but there's not a strong case for introducing it now as far as I can tell. Review status: 7 of 22 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed. c-deps/libroach/db.cc, line 1677 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
I see why I'm confused: I thought the c-deps/libroach/db.cc, line 1694 at r6 (raw file):
s/entire database/specified keyspan wholesale/ c-deps/libroach/db.cc, line 1695 at r6 (raw file):
pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
I'd much prefer if this this call weren't the first to have a new set of assumptions that need to hold for it to be safe. It seems that we can avoid it (see option three), so I think we should (and in turn remove the assumptions from the comments). pkg/storage/engine/compactor.go, line 52 at r2 (raw file): There are some good points there, but I'm skeptical. Consider the following instead:
I agree that there is some general benefit to the compactor queue in the long run (and when it becomes truly useful, we can change what PS: I didn't understand what you mean with
Comments from Reviewable |
83d6d1e
to
e917a00
Compare
Thank you for the review. I remain convinced this is the right approach...especially given that you agree already this will eventually be necessary to cleanup after aggressive GC activity. I will continue making my case: Your suggested approach works for the case of The overall goal here is straightforward: CockroachDB should reclaim disk space within reasonable operator expectations for timing, in situations where they'll be expecting an RDBMS to do so. So far, we've failed on this score and it's left egg on our faces. It's of course not reasonable or possible to do the most efficient implementations right off the bat – and it's certainly disappointing that RocksDB doesn't do a better job of scheduling these itself. However, this PR is the foundation for a comprehensive solution to this problem. I don't believe it makes sense to introduce a solution which targets only 1/3 of the identified problem areas, and wait for the remainder to bite us before we must fix them for v2.1 (e.g. handle DROP TABLE, but ignore rebalancing and large-scale GC). That's another six months where we have to explain to people that they need to wait for RocksDB to background compact. Here are the user stories, to be explicit about it: "I just followed the demo on the website for migration between clouds, but now both my nodes in GCP and AWS are using significant disk space – in fact, the original GCP nodes, which should be mostly empty, are using more disk space than the AWS nodes. If I migrate a table off of GCP, why doesn't it free up the disk space? The admin UI shows the data has moved but when I "I set my TTL for table When we have a solution that handles all three cases, why choose to fix just one case? It sounds like we're worried about complexity. Let me address that here.
To gauge the danger of this approach (what if the compactor runs wild?), I've done a small experiment to ascertain the performance hit on the system while asking RocksDB to compact full-range key spans, one at a time, across a 200MiB+ database, all ranges with less than 64MiB of logical bytes, all forcing compaction to the bottommost level. Note the impact on KV benchmark with 95% read (note the arrow below shows where the compact range loop starts (1 minute into process):
And for 100% write the performance hit is more pronounced:
The good news is that RocksDB apparently doesn't naively rewrite SSTables on The bottom line is that these three situations where space-reclamation should occur can be handled by this mechanism in CockroachDB v2.0, and I don't think we should settle for a solution which handles only one of three. Review status: 7 of 22 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed. c-deps/libroach/db.cc, line 1653 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
Will experiment with this. c-deps/libroach/db.cc, line 1690 at r5 (raw file): Previously, petermattis (Peter Mattis) wrote…
Didn't change it...I only changed the previous one. Fixed. c-deps/libroach/db.cc, line 1694 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. c-deps/libroach/db.cc, line 1695 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Option three doesn't allow us to remove the assumptions from this method's API.proto comments. You absolutely cannot read or write concurrently, or read after, from this span when Clearing a span which is just a subset of a range is exactly as (un)safe as clearing the whole range. Why should we fall back to the slow path? If it makes things more palatable for you, I could just remove setting of the GC threshold entirely. It's a fig leaf. pkg/storage/engine/compactor.go, line 52 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Hopefully my comment at the top addresses your suggestion. I think what you've proposed here is elegant and at least slightly simpler than what I'm doing. But it doesn't set us up to solve two other major problems with our current behavior which I'm resolved to address for v2.0. I would like to put the larger problem here out of its misery. I further expect that the three areas I've identified will not be the only three. Comments from Reviewable |
My main concern is that this is over-engineered for the DROP TABLE case, so if we're moving ahead with broader use of this queue, I'm OK with it. But what's being dropped to make room for all of this in 2.0? Reviewed 14 of 24 files at r6, 2 of 2 files at r7. pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):
In addition to the desire to use this on sub-range parts of tables, I think we might want to stop auto-splitting all tables (until we can merge empty ranges).
Yeah, I'd rather stop setting this. If it's safe to not set it for partial range operations, it's safe to not set it for whole range operations. pkg/storage/compactor/compactor.go, line 270 at r7 (raw file):
Basing this on the newest rather than the oldest could result in an endlessly-growing set of suggestions (which is one kind of "running amok", not just endless compactions). pkg/storage/compactor/compactor.go, line 287 at r7 (raw file):
Use ClearIterRange instead of ClearRange. The range tombstones left by ClearRange have a cost. (Feel free to rename these methods to indicate that ClearIterRange is the preferred default) pkg/storage/compactor/compactor.go, line 369 at r7 (raw file):
There's no double-counting, but it's racy. We may end up over-counting the bytes available to reclaim in a span we just compacted. Something else to watch out for. pkg/storage/compactor/compactor_test.go, line 63 at r7 (raw file):
This test looks unfinished. Comments from Reviewable |
@spencerkimball thanks for running the experiments. Curious how bigger tables turn out. re: the user stories: I don't dispute that in the long run we probably want to do more than just run a compaction when a table is dropped. I can see fairly directly how this would work after replica GC (the cloud migration case), and the queue seems like a good fit for that. It's much less straightforward to make this work for compactions triggered by the GC queue. For the drop table case though, your solution seems strictly worse than the RPC. For example, if a node has two replicas of a range with lots of empty space in between (for which it used to have replicas, i.e. there's lots of data), you won't compact them in one go. That suggests that even with the compactor, we may still want to introduce the RPC and issue it after the That said, since you're hellbent on this anyway and I can't spend more cycles pushing for more pragmatism, here's my suggestion wishlist:
I'm also slightly worried about long keys in the compactor. For example, if you have a large table with multi-mb primary keys (and thus potentially multi-mb split keys), and drop that table, you get a lot of compactions (with multi-mb keys) and the compactor will pull them all into memory (I think the right solution here is to avoid long split keys, though). Review status: all files reviewed at latest revision, 22 unresolved discussions, some commit checks failed. pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):
From the SQL standpoint, yes. But it retained the KV transactional guarantees (on a per-range basis) by preventing incorrect reads/writes. I agree with you both though that there seems to be no salvaging that in a meaningful way due to the partial range issue. Let's remove the fig leaf. pkg/storage/compactor/compactor.go, line 44 at r7 (raw file):
For the rebalancing case, I'm not sure 2min will cut it. For a large table, when replicas are removed in random order, you'll have to wait a while to see contiguous patches. Just something to watch out for. Perhaps being 100% optimal isn't what we're going for anyway. pkg/storage/compactor/compactor.go, line 49 at r7 (raw file):
Mention that this is tuned for our L6 target SSTable size. pkg/storage/compactor/compactor.go, line 54 at r7 (raw file):
I'm not sure this is useful. If it is, could you explain when? pkg/storage/compactor/compactor.go, line 59 at r7 (raw file):
Explain the rationale behind this number. Isn't 128MB (our L6 SSTable target) a better heuristic? Also, using That would also make me feel less concerned about the heuristic in general and would remove this ad-hoc number plus relying on pkg/storage/compactor/compactor.go, line 64 at r7 (raw file):
This is tuned to the default GC TTL, but with slow enough churn (and GC queue hints) you'd never compact. I know this isn't a goal in this PR, but it is somehow your goal in introducing this mechanism in the first place (admittedly it'll work well for the replicaGCQueue), and so I'm curious to understand how you plan to make this work well with GCQueue suggestions. Not a blocker, but ISTM that a better place to store the suggestions would be in a range-local key that is managed by the GCQueue, where it would send a compaction hint only when it's reasonably sure that a compaction is now warranted (as opposed to putting stuff in this compactor early and letting it endlessly deal with it). But even that is kinda frail, because RocksDB runs its own compactions too and we don't want to duplicate work. pkg/storage/compactor/compactor.go, line 77 at r7 (raw file):
I'd make this a method, for nobody should be able to mutate it: func defaultCompactorOptions() compactorOptions {
return ...
} pkg/storage/compactor/compactor.go, line 204 at r7 (raw file):
do you need a pkg/storage/compactor/compactor.go, line 218 at r7 (raw file):
Weird that you need to pass pkg/storage/compactor/compactor.go, line 245 at r7 (raw file):
pass the following instead
and remove I'm not sure we need pkg/storage/compactor/compactor.go, line 252 at r7 (raw file):
is that second part ever useful? pkg/storage/compactor/compactor.go, line 263 at r7 (raw file):
Message seems off.
After my suggestion above, this reads something like duration := timeutil.Since(startTime)
c.Metrics.CompactingNanos.Inc(int64(duration))
log.Eventf(ctx, processed %s in %s", aggr, duration) pkg/storage/compactor/compactor.go, line 270 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
It couldn't be endless because aggregating more and more increases the compaction size and it would hit the threshold. But it sure seems (at least once the GC queue gives compaction hints) that we could end up with a large amount of suggestions. Say we have 1000 replicas that are far apart and slowly accruing garbage. So every hour or so we receive 1000 compaction hints for small amounts. pkg/storage/engine/enginepb/compact.proto, line 1 at r7 (raw file):
Should move to Comments from Reviewable |
With scalefactor=1 for tpch, it compacted a dropped 1.3GiB table (
|
e917a00
to
b8439f3
Compare
Note that I still haven't done the compactor unittests. Will get to that now that I think the design has stabilized. Review status: 8 of 19 files reviewed at latest revision, 22 unresolved discussions. pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor.go, line 44 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Which is fine. It still has to pass the thresholds, which means we'll have to wait for enough contiguous ranges to be compactable. I've changed pkg/storage/compactor/compactor.go, line 49 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I've increased this to pkg/storage/compactor/compactor.go, line 54 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
The idea here is to make sure this feature works when folks are dropping tables in a relatively small database. This is how most users are evaluating CockroachDB. If you have 200MiB of data and you drop the whole thing, you want to see you disk space free up, believe me. This is the trigger that ensures that still happens, since the absolute threshold won't trigger. I expanded the comment. pkg/storage/compactor/compactor.go, line 59 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
That's a solid idea. I've removed this pkg/storage/compactor/compactor.go, line 64 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
So far I've really only been trying to make this PR handle the pkg/storage/compactor/compactor.go, line 77 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor.go, line 204 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
We always call commit on the batch, which has the effect of closing it. Am I missing something? pkg/storage/compactor/compactor.go, line 218 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I re-jiggered this a bit based on your suggestions. pkg/storage/compactor/compactor.go, line 245 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
See comment above. We do still need capacity for the fractional threshold. pkg/storage/compactor/compactor.go, line 252 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
As I mentioned in an earlier comment, the fractional threshold is about still compacting when people have small databases, but nevertheless expect disk space to free up. pkg/storage/compactor/compactor.go, line 263 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I wanted the log messages to show how things are aggregating – good for debugging if things start acting weird. I made some changes to make it more readable and to get the indexing right. I've added a metric for time spent compacting. pkg/storage/compactor/compactor.go, line 270 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
As I mentioned in offline comments, my goal with this isn't to send micro hints to the compactor, but only to send suggestions when a range is cleared, rebalanced, or 90% of its data is GC'd. But to be cautious, I've changed it to always delete records after the max record age. pkg/storage/compactor/compactor.go, line 287 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/compactor/compactor.go, line 369 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Why is it racy? This code path will be protected at a higher level by the command queue and our replica gc machinery. Are you referring to something else? pkg/storage/compactor/compactor_test.go, line 63 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I was waiting for buy in before writing the tests. pkg/storage/engine/enginepb/compact.proto, line 1 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. Comments from Reviewable |
Reviewed 15 of 15 files at r8. pkg/storage/batcheval/cmd_clear_range.go, line 57 at r8 (raw file):
We can remove the GC threshold key from declareKeys. pkg/storage/compactor/compactor.go, line 54 at r7 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
For perspective, mysql/innodb in its default configuration did not return unused disk space to the system until version 5.6. There are still parts of the system that are unable to shrink without wiping the disk and starting from scratch. It's great that we're improving our ability to free up space, but remember that failing to do so is not a dealbreaker. pkg/storage/compactor/compactor.go, line 369 at r7 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
The race is between the GetProto and PutProto in this method, and the deletions in processCompaction. Suppose there is a 1MB compaction record on disk, and then a new 1MB suggestion arrives for the same range. You could have
pkg/storage/engine/rocksdb.go, line 301 at r8 (raw file):
Comments from Reviewable |
b8439f3
to
49e32c0
Compare
Review status: 17 of 19 files reviewed at latest revision, 22 unresolved discussions. pkg/storage/batcheval/cmd_clear_range.go, line 57 at r8 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/compactor/compactor.go, line 54 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I agree to some extent, MySQL/innodb v5.6 was two decades into the product lifecycle. However, you ultimately have to deliver according to customer expectations. We are trying to appeal to a class of customer which would not have taken earlier versions of MySQL very seriously. Not to mention that the goalposts of what an RDBMS should do have been moved substantially forward, even from 2013. pkg/storage/compactor/compactor.go, line 369 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Ah yes good point. Do you think it's worth fixing this or just a comment? pkg/storage/engine/rocksdb.go, line 301 at r8 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
Have to dig into the tests and the SSTable-spy tomorrow, but generally this is starting to look close to final. Good work! Reviewed 4 of 24 files at r6, 9 of 15 files at r8, 2 of 2 files at r9. pkg/storage/batcheval/cmd_clear_range.go, line 109 at r9 (raw file):
Just use the timestamp of the request. Taking it from the clock was necessary for the fig leaf, but that is no more. pkg/storage/compactor/compactor.go, line 49 at r7 (raw file):
pkg/storage/compactor/compactor.go, line 54 at r7 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
I'm personally not convinced by this, but perhaps you're right. If you adequately test it, no opposition on my part. pkg/storage/compactor/compactor.go, line 204 at r7 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Probably not, I just don't know what guarantee you have if pkg/storage/compactor/compactor.go, line 220 at r9 (raw file):
nit: pkg/storage/compactor/compactor.go, line 248 at r9 (raw file):
Wouldn't you want to continue compacting "other things"? I think this would work better if pkg/storage/compactor/compactor.go, line 287 at r9 (raw file):
Pull most of the above into log.Eventf(ctx, "processing suggested compaction %s", aggr) Actually, may wanna scratch "suggested". pkg/storage/compactor/compactor.go, line 299 at r9 (raw file):
Ditto:
(the trace will already keep track of the duration, btw, but since you have it might as well print it) pkg/storage/compactor/compactor.go, line 301 at r9 (raw file):
Ditto pkg/storage/compactor/compactor.go, line 339 at r9 (raw file):
I'm confused. Why at most? If you have two key spans that touch on the SST level, wouldn't you always want to merge them, no matter how many SSTables are overlapping? Why don't you just check whether the "gap" pkg/storage/compactor/compactor.go, line 342 at r9 (raw file):
I haven't looked at that method yet which probably explains why I'm confused, but I expected pkg/storage/compactor/metrics.go, line 36 at r9 (raw file):
Could you sprinkle pkg/storage/engine/rocksdb.go, line 265 at r9 (raw file):
should pkg/storage/engine/rocksdb.go, line 267 at r9 (raw file):
nit: double dot pkg/storage/engine/rocksdb.go, line 269 at r9 (raw file):
nit: pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file):
I have to look at this with fresh eyes tomorrow, but still confused about the "at most two". pkg/storage/engine/rocksdb_test.go, line 874 at r9 (raw file):
Comments please! 😄 Comments from Reviewable |
49e32c0
to
45d153d
Compare
Review status: 11 of 19 files reviewed at latest revision, 25 unresolved discussions. pkg/storage/batcheval/cmd_clear_range.go, line 109 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor.go, line 49 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor.go, line 54 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
We've had more people query us about why space isn't freeing up on a small database than on large ones. But it's now pretty well tested. pkg/storage/compactor/compactor.go, line 204 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
My review of the code after your original comment suggests that it closes on failure. pkg/storage/compactor/compactor.go, line 220 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I removed this in favor of a new interface which adds the pkg/storage/compactor/compactor.go, line 248 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I'm logging an error now and continuing with whatever other suggested compactions may be waiting. pkg/storage/compactor/compactor.go, line 287 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor.go, line 299 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor.go, line 301 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor.go, line 339 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yes, this was an error I caught in my unittests. pkg/storage/compactor/compactor.go, line 342 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
pkg/storage/compactor/metrics.go, line 36 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. Good idea. pkg/storage/engine/rocksdb.go, line 265 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Size is irrelevant to the project here, since we're sorting first by start key and the sstables are disjoint at L1+. I've tried to clarify the comment. pkg/storage/engine/rocksdb.go, line 267 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/rocksdb.go, line 269 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Here's the thinking: Let's say you have only L0, L1, & L2 sstables in a small database for example purposes. And let's say they look as follows:
pkg/storage/engine/rocksdb_test.go, line 874 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Writing comments exposed a bug. Thank you. Comments from Reviewable |
Review status: 11 of 19 files reviewed at latest revision, 25 unresolved discussions, some commit checks pending. pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
I haven't been following the developments of this PR, but this caught my eye. We should never have the situation where we have L2 sstables but not L6 sstables. See the Comments from Reviewable |
pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file): Previously, petermattis (Peter Mattis) wrote…
That’s good to know, but it doesn’t change the algorithm here. Or do you think it does somehow that I’m missing? Comments from Reviewable |
Review status: 11 of 19 files reviewed at latest revision, 25 unresolved discussions, some commit checks failed. pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Not sure if you're missing anything, just pointing out that you should be considering gaps in the level structure when thinking through examples. Comments from Reviewable |
Only have to look at the tests. Rest LGTM mod (insubstantial) comments. Reviewed 8 of 8 files at r10. pkg/storage/compactor/compactor.go, line 342 at r9 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
I was just confused because I wasn't sure what was going on. Resolved now. pkg/storage/engine/engine.go, line 243 at r10 (raw file):
I'd prefer if you just threw this into pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file):
Ideally, the example would also have empty levels, just to be more true to reality with our configuration. pkg/storage/engine/rocksdb.go, line 317 at r10 (raw file):
how about if i >= len(tables) - 1 {
// First overlapped SSTable is the last (right-most) SSTable.
// Span: [c-----f)
// SSTs: [a---d)
// or
// SSTs: [a-----------q)
return false
}
if span.EndKey.Compare(tables[i+1].End.Key) <= 0 {
// Span does not reach outside of this SSTable's right neighbor.
// Span: [c------f)
// SSTs: [a---d) [e-f) ...
return false
}
if i >= len(tables) - 2 {
// Span reaches outside of this SSTable's right neighbor, but
// there are no more SSTables to the right.
// Span: [c-------------x)
// SSTs: [a---d) [e---q)
return false
}
if span.EndKey.Compare(tables[i+2].Start.Key) <= 0 {
// There's another SSTable two to the right, but the span doesn't
// reach into it.
// Span: [c------------x)
// SSTs: [a---d) [e---q) [x--z) ...
return false
}
// Touching at least three SSTables.
// Span: [c-------------y)
// SSTs: [a---d) [e---q) [x--z) ...
return true Comments from Reviewable |
Reviewed 1 of 15 files at r8. pkg/storage/compactor/compactor_test.go, line 71 at r10 (raw file):
Should have some empty levels in between for realism. Just move L2 to L6 and insert some more L2 which (I think) would allow you to avoid large changes in the tests? pkg/storage/compactor/compactor_test.go, line 127 at r10 (raw file):
Was here a reason to make these pkg/storage/compactor/compactor_test.go, line 191 at r10 (raw file):
Nice touch dealing with the integer division. pkg/storage/compactor/compactor_test.go, line 305 at r10 (raw file):
s/space/gap here and above and below? pkg/storage/compactor/compactor_test.go, line 465 at r10 (raw file):
Insert a pkg/storage/compactor/compactor_test.go, line 466 at r10 (raw file):
Once you're here, shouldn't the rest always succeed? That would suggest moving it out of the pkg/storage/compactor/compactor_test.go, line 474 at r10 (raw file):
t.Fatal? Ditto below. pkg/storage/compactor/compactor_test.go, line 505 at r10 (raw file):
Could you just set pkg/storage/compactor/compactor_test.go, line 515 at r10 (raw file):
pkg/storage/compactor/compactor_test.go, line 519 at r10 (raw file):
pkg/storage/compactor/compactor_test.go, line 522 at r10 (raw file):
Shouldn't you set this before calling pkg/storage/compactor/compactor_test.go, line 571 at r10 (raw file):
use the pkg/storage/engine/rocksdb.go, line 301 at r8 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Not done? pkg/storage/engine/rocksdb_test.go, line 869 at r10 (raw file):
Have only casually browsed this test. There's a lot of cognitive overhead to verify the results. Unfortunately, I don't have a good idea on how to improve it. Comments from Reviewable |
45d153d
to
203bb04
Compare
Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed. pkg/storage/compactor/compactor.go, line 342 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Gotcha. Yeah this stuff is confusing until you pore over it long enough. Even then. pkg/storage/compactor/compactor_test.go, line 71 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Did that and it worked as expected. I moved L1 -> L2 and L2 -> L6. pkg/storage/compactor/compactor_test.go, line 127 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
No, I believe other things at the same level were and I was cargo-culting. Changed. pkg/storage/compactor/compactor_test.go, line 191 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
:-) pkg/storage/compactor/compactor_test.go, line 305 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor_test.go, line 465 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Neat idea. Done. pkg/storage/compactor/compactor_test.go, line 466 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Not necessarily. All of the metrics and the actual compactions slice can be set before we manage to commit the pkg/storage/compactor/compactor_test.go, line 474 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. But not sure I understand what the ditto refers to. pkg/storage/compactor/compactor_test.go, line 505 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor_test.go, line 515 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor_test.go, line 519 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor_test.go, line 522 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/compactor/compactor_test.go, line 571 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/engine.go, line 243 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
So far we've avoided adulterating pkg/storage/engine/rocksdb.go, line 301 at r8 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I added a comment below in the last commit. I've added another before pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/engine/rocksdb.go, line 317 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Wow...done. pkg/storage/engine/rocksdb_test.go, line 869 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I'm fairly comfortable with it. Comments from Reviewable |
Reviewed 6 of 6 files at r11. pkg/storage/compactor/compactor.go, line 393 at r11 (raw file):
Any point in the pkg/storage/compactor/compactor_test.go, line 474 at r10 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
This was in reference to moving everything here out of pkg/storage/engine/engine.go, line 243 at r10 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
I prefer the explicitness over finding out the hard way via an interface assertion panic in Comments from Reviewable |
Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks pending. pkg/storage/compactor/compactor.go, line 393 at r11 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I was under the impression that those routines were meant to deal with raw keys specifically. Is that not the case? Or because we don't expect these kinds of keys to ever have addressing info, it's moot? pkg/storage/compactor/compactor_test.go, line 474 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Gotcha. pkg/storage/engine/engine.go, line 243 at r10 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I figure that will happen in testing when we introduce a new engine type and we'll add the right bifurcation at that point. Comments from Reviewable |
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed. pkg/storage/compactor/compactor.go, line 393 at r11 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Yeah. RKey matters only for keys that need to be routed through DistSender. Comments from Reviewable |
203bb04
to
a2f62f4
Compare
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed. pkg/storage/compactor/compactor.go, line 393 at r11 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. Comments from Reviewable |
a2f62f4
to
697dbeb
Compare
Clear range commands now come with an attendant suggested range compaction hint. Any suggested compactions generated during command execution are now sent via replicated result data to each replica and stored in a store-local queue of pending compaction suggestions. A new compactor goroutine runs periodically to process pending suggestions. If more than an absolute number of bytes is reclaimable, or if the bytes to reclaim exceed a threshold fraction of the total used bytes, we'll go ahead and compact the suggested range. Suggested compactions are allowed to remain in the queue for at most 24 hours, after which if they haven't been aggregated into a compact-able key span, they'll be discarded, and left to RocksDB's background compaction processing. Release note (UX improvement): When tables are dropped, the space will be reclaimed in a more timely fashion.
697dbeb
to
e73fd22
Compare
Clear range commands now come with an attendant suggested range compaction
hint. Any suggested compactions generated during command execution are now
sent via replicated result data to each replica and stored in a store-local
queue of pending compaction suggestions.
A new compactor goroutine runs periodically to process pending suggestions.
If more than an absolute number of bytes is reclaimable, or if the bytes
to reclaim exceed a threshold fraction of the total used bytes, or if it's
just been a threshold amount of time since the last processing, we'll go
ahead and try to compact the suggested range.
Each suggestion has a "cleared" flag, which if set, indicates that the
suggestion is for a range which will never be written to by any SQL
process (and have it succeed). If cleared is true, the compactor first
invokes
rocksdb::DeleteAllFilesInRange
in order to drop SSTables fastand avoid processing them later. It then invokes
rocksdb::CompactRange
to clean up the remainder.
Release note (performance improvement): When tables are dropped, the
space will be reclaimed in a more timely fashion.