Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: implement a queue for suggested compactions #20607

Merged
merged 1 commit into from
Dec 24, 2017

Conversation

spencerkimball
Copy link
Member

Clear range commands now come with an attendant suggested range compaction
hint. Any suggested compactions generated during command execution are now
sent via replicated result data to each replica and stored in a store-local
queue of pending compaction suggestions.

A new compactor goroutine runs periodically to process pending suggestions.
If more than an absolute number of bytes is reclaimable, or if the bytes
to reclaim exceed a threshold fraction of the total used bytes, or if it's
just been a threshold amount of time since the last processing, we'll go
ahead and try to compact the suggested range.

Each suggestion has a "cleared" flag, which if set, indicates that the
suggestion is for a range which will never be written to by any SQL
process (and have it succeed). If cleared is true, the compactor first
invokes rocksdb::DeleteAllFilesInRange in order to drop SSTables fast
and avoid processing them later. It then invokes rocksdb::CompactRange
to clean up the remainder.

Release note (performance improvement): When tables are dropped, the
space will be reclaimed in a more timely fashion.

@spencerkimball spencerkimball requested review from tbg and a team December 11, 2017 02:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spencerkimball
Copy link
Member Author

Ignore the first commit. It's just the fast drop path.

Also, this needs unittests, but no point in writing them if this direction doesn't pan out. I have tested this on a local database with a couple hundred MiB, and it worked fine (and freed up the disk space too fast even really notice).

@tbg
Copy link
Member

tbg commented Dec 11, 2017

This looks pretty good! I think we want to be conservative about DeleteFilesInRange, though. The semantics are wildly different and I don't think we need to lean quite as far out the window for the next release. I'd be fine leaving the code in, though, just not enable it by default (also not as a cluster setting, an env var would be ok).


Review status: 0 of 38 files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1677 at r2 (raw file):

  all_metadata.clear();

  if (max_level == db->rep->NumberLevels() - 1) {

NYC, but this block is so large that I think it'd look better after an inversion:

if (max_level != db->rep->NumberLevels() - 1) {
  // There are no sstables at the lowest level, so just compact the
  // entire database. Due to the level_compaction_dynamic_level_bytes
  // setting, this will only happen on very small databases.
  return ToDBStatus(db->rep->CompactRange(options, NULL, NULL));
}

// all the nontrivial code

I'm also doubtful that you really want to update max_level only for the key ranges you're interested in. If I pass a key range, I'd never want a full compaction. Or perhaps you need to just pass start_key and end_key instead of NULL here. @petermattis will know best which one it is.


c-deps/libroach/db.cc, line 1750 at r2 (raw file):

}

DBStatus DBDeleteFilesInRange(DBEngine* db, DBKey start, DBKey end) {

I think it's premature to use DeleteFilesInRange, for the reasons outlined in #19329 (comment).


pkg/keys/keys.go, line 65 at r2 (raw file):

// suggested compaction. It combines the specified start and end keys.
func StoreSuggestedCompactionKey(start, end roachpb.RKey) roachpb.Key {
	detail := make(roachpb.RKey, 0, len(start)+1+len(end)+1)

Does this computation make sense? The ascending encoding is escaping-based, isn't it? The buffer size here makes it look like it's length-encoded (which it seems you could get away with here, but it's not worth it).


pkg/keys/printer.go, line 195 at r2 (raw file):

	key, end, err = encoding.DecodeBytesAscending(key, nil)
	if err != nil {
		return fmt.Sprintf("<invalid range: %s>", err)

Not that it'll ever matter, but you can print start here in this message.


pkg/keys/printer_test.go, line 51 at r2 (raw file):

		{StoreClusterVersionKey(), "/Local/Store/clusterVersion"},
		{StoreSuggestedCompactionKey(roachpb.RKey(MinKey), roachpb.RKey("b")), "/Local/Store/suggestedCompaction/{/Min-\"b\"}"},
		{StoreSuggestedCompactionKey(roachpb.RKey("a"), roachpb.RKey("b")), "/Local/Store/suggestedCompaction/{\"a\"-\"b\"}"},

only a nit, but you could use backticks instead of quotes and then you don't have to escape quotes within the string: https://play.golang.org/p/_lBx1kzjfF


pkg/storage/engine/compactor.go, line 15 at r2 (raw file):

// permissions and limitations under the License.

package engine

Make this a new package engine/compactor


pkg/storage/engine/compactor.go, line 39 at r2 (raw file):
super nit: this feels slightly yoda.

120*time.Second


pkg/storage/engine/compactor.go, line 80 at r2 (raw file):

		ctx, keys.LocalStoreSuggestedCompactionsMin, keys.LocalStoreSuggestedCompactionsMax,
	); err != nil {
		log.ErrEventf(ctx, "failed check whether compaction suggestions exist: %s", err)

Shouldn't this be log.Warningf?


pkg/storage/engine/compactor.go, line 93 at r2 (raw file):

			case <-c.ch:
				// Set the wait timer if not already set.
				log.Infof(ctx, "woken up to process suggestions")

Doubt this should be Infof.


pkg/storage/engine/compactor.go, line 95 at r2 (raw file):

				log.Infof(ctx, "woken up to process suggestions")
				if !timerSet {
					timer.Reset(compactionMinInterval)

timerSet = true


pkg/storage/engine/compactor.go, line 103 at r2 (raw file):

				ok, err := c.processSuggestions(ctx, lastProcessed)
				if !ok || err != nil {
					log.ErrEventf(ctx, "failed processing suggested compactions: %s", err)
if err != nil {
  log.Warningf(..) //(unless you'd ever expect to see this during normal operations)
} else if !ok {
  break
}

pkg/storage/engine/compactor.go, line 109 at r2 (raw file):

				// No need for the timer; everything has been processed. Wait
				// for the next suggestion before resetting the timer.
				lastProcessed = timeutil.Now()

Move this line up, so that it's also hit in the !ok branch (but not the err != nil one).


pkg/storage/engine/compactor.go, line 124 at r2 (raw file):

// been sufficiently long since the last successful processing.
// Returns a boolean indicating whether the processing occurred.
func (c *Compactor) processSuggestions(

Make sure to open a trace at the caller (tracing.EnsureContext).


pkg/storage/engine/compactor.go, line 130 at r2 (raw file):

	rocksdb, ok := c.eng.(*RocksDB)
	if !ok {
		return false, fmt.Errorf("cannot process suggestions on engine of type %T", c.eng)

FWIW, I think you should add the required method to the Engine interface and make it a no-op for anything that doesn't support it.
I'm pretty sure we don't want to do DeleteFilesInRange just yet.


pkg/storage/engine/compactor.go, line 136 at r2 (raw file):

	var suggestions []enginepb.Compaction
	var totalBytes int64
	if err := c.eng.Iterate(

to manage Key and EndKey after removing them from the proto, add

type compaction struct {
  enginepb.Compaction
  StartKey, EndKey roachpb.Key
}

and populate them in this loop.


pkg/storage/engine/compactor.go, line 156 at r2 (raw file):

a compaction


pkg/storage/engine/compactor.go, line 168 at r2 (raw file):

		log.Eventf(ctx, "skipping compaction with %d suggestions, total=%db, used=%db, last processed=%s",
			len(suggestions), totalBytes, capacity.Used, lastProcessed)
		log.Infof(ctx, "skipping compaction with %d suggestions, total=%db, used=%db, last processed=%s",

Infof implies Eventf, so you don't need both.

This is kinda hard to read, perhaps factor it out:

shouldProcess := totalBytes >= thresholdBytes || totalBytes >= int64(float64(capacity.Used)*thresholdBytesFraction) || timeutil.Since(lastProcessed) >= thresholdTimeSinceLastProcess

if !shouldProcess {
  log.Infof(ctx, "skipping compaction with %d suggestions, total=%db, used=%db, last processed=%s",
			len(suggestions), totalBytes, capacity.Used, lastProcessed)
  return false, nil
}

pkg/storage/engine/compactor.go, line 175 at r2 (raw file):

	// Process each suggestion in turn.
	for _, sc := range suggestions {
		log.VEventf(ctx, 2, "processing suggested compaction %+v", sc)

Remove.


pkg/storage/engine/compactor.go, line 176 at r2 (raw file):

	for _, sc := range suggestions {
		log.VEventf(ctx, 2, "processing suggested compaction %+v", sc)
		log.Infof(ctx, "processing suggested compaction %+v", sc)

I'd log after having processed a compaction, and include the duration and its index (and make that the only log message emitted by the compactor in the common case):

tBegin = timeutil.Now()
// work
log.Infof(ctx, "processed compaction #%d/%d for %s in %s", i+1, len(suggestions), humanizeutil.IBytes(sc.Bytes), timeutil.Since(tBegin))

pkg/storage/engine/compactor.go, line 206 at r2 (raw file):

	log.Eventf(ctx, "processed all %d compaction(s) successfully", len(suggestions))
	log.Infof(ctx, "processed all %d compaction(s) successfully", len(suggestions))

These can go.


pkg/storage/engine/compactor.go, line 230 at r2 (raw file):

// SuggestCompaction writes the specified compaction to persistent
// storage and pings the processing goroutine.
func (c *Compactor) SuggestCompaction(ctx context.Context, sc enginepb.Compaction) {

This would pick up startKey, andKey roachpb.Key.


pkg/storage/engine/compactor.go, line 246 at r2 (raw file):

	if ok {
		if existing.Cleared && !sc.Cleared {
			panic("existing suggested compaction was cleared, but new is not cleared")

log.Fatalf(ctx, "existing compaction %+v incompatible with suggested compaction %+v", existing, sc)

but I'm not sure this should be fatal error. You would just unset the Cleared bit. We know that for the table drop use case this wouldn't ordinarily happen, but that doesn't mean it's illegal to re-use the key space.


pkg/storage/engine/compactor.go, line 253 at r2 (raw file):

	// Store the new compaction.
	if _, _, err = PutProto(c.eng, MVCCKey{Key: key}, &sc); err != nil {
		log.ErrEventf(ctx, "unable to record suggested compaction: %s", err)

I would happily make that a warning.


pkg/storage/engine/enginepb/compact.proto, line 23 at r2 (raw file):

  option (gogoproto.equal) = true;

  bytes start_key = 1; // cannot specify roachpb.Key without creating import loop.

This proto is stored at a key that encodes both start key and end key, so I don't think they need to contain the keys a second time.


pkg/storage/engine/enginepb/compact.proto, line 23 at r2 (raw file):

  option (gogoproto.equal) = true;

  bytes start_key = 1; // cannot specify roachpb.Key without creating import loop.

nit: no trailing dot.


pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file):

  // deleted, and will not be re-used. This allows the compactor to
  // more efficiently reclaim space.
  bool cleared = 3;

Won't be needed without DeleteFilesInRange.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Reviewed 1 of 23 files at r1, 21 of 21 files at r2.
Review status: 22 of 38 files reviewed at latest revision, 33 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1750 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think it's premature to use DeleteFilesInRange, for the reasons outlined in #19329 (comment).

+1. Shouldn't rocksdb be able to figure this out itself when it compacts the range?


pkg/roachpb/api.proto, line 260 at r2 (raw file):

  util.hlc.Timestamp gc_threshold = 2 [(gogoproto.nullable) = false,
      (gogoproto.customname) = "GCThreshold"];
  Span suggested_compaction = 3 [(gogoproto.nullable) = false];

Why is this different from header?

...

OK, after finishing the rest of the review I see what you're trying to do here. Maybe call it entire_span at this level (with comments) and then cmd_clear_range.go turns it into a "compaction"


pkg/sql/tablewriter.go, line 822 at r2 (raw file):

		},
		GCThreshold:         hlc.Timestamp{WallTime: td.tableDesc().GCDeadline},
		SuggestedCompaction: tableSpan,

If this request gets split up by the DistSender, the header span will be truncated but this one won't. That could mean that we suggest a compaction for parts of the table that haven't been covered by a ClearRange yet. This is definitely unsafe with DeleteFilesInRange; with CompactRange instead it may be inefficient but at least it's not broken.


pkg/storage/engine/compactor.go, line 39 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

super nit: this feels slightly yoda.

120*time.Second

Why not 2*time.Minute with no comment?


pkg/storage/engine/compactor.go, line 47 at r2 (raw file):

	// thresholdBytesFraction is the fraction of total bytes used which
	// are up for suggested reclamation, after which the compactor will
	// begin processing. Note that because data is prefix- and snappy-

Maybe we should compute this fraction using capacity.LogicalBytes instead of capacity.Used so we're doing apples-to-apples math. Otherwise, be clearer about how we're mixing units here: "Note that the numerator of this fraction is based on uncompressed data while the denominator is prefix- and snappy-compressed..."


pkg/storage/engine/compactor.go, line 52 at r2 (raw file):

	thresholdBytesFraction = 0.05 // more than 5% of space will trigger

	// thresholdTimeSinceLastProcess is the maximum amount of time since

The relationship between the two time parameters is not clear from these comments.

This whole process feels unnecessarily complex and reliant on magic numbers. I'm not sure that thresholdTimeSinceLastProcess makes sense - why should dropping a tiny table force a compaction after 2 hours (which I think implies rewriting at least one 128MB sstable)? If the table is too small, leave it for rocksdb to process naturally.


pkg/storage/engine/compactor.go, line 107 at r2 (raw file):

					break
				}
				// No need for the timer; everything has been processed. Wait

This could race with a concurrent suggestion. We need to clear the timer before going into processSuggestions, so that any suggestions that arrive will queue up a new event.


pkg/storage/engine/compactor.go, line 248 at r2 (raw file):

			panic("existing suggested compaction was cleared, but new is not cleared")
		}
		sc.Bytes += existing.Bytes

Isn't this double-counting? If the same compaction is suggested twice it should free up the same data.

(after finishing the review, I see that I was wrong here, but it shows that this is surprising and needs a comment)


pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file):

  // deleted, and will not be re-used. This allows the compactor to
  // more efficiently reclaim space.
  bool cleared = 3;

Is this preparing for the future? It looks like it's always set to true today.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 11, 2017

Reviewed 23 of 23 files at r1.
Review status: all files reviewed at latest revision, 33 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1750 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

+1. Shouldn't rocksdb be able to figure this out itself when it compacts the range?

That's a good question, I wonder what RocksDB will make of this. It would ideally make this about as efficient as calling DeleteFilesInRange itself but we should check.


pkg/sql/tablewriter.go, line 822 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If this request gets split up by the DistSender, the header span will be truncated but this one won't. That could mean that we suggest a compaction for parts of the table that haven't been covered by a ClearRange yet. This is definitely unsafe with DeleteFilesInRange; with CompactRange instead it may be inefficient but at least it's not broken.

Good point. Yet more evidence that we shouldn't be DeleteFilesInRangeing yet.

BTW, a straightforward fix for this is to issue a second ClearRange separately which has the Clear field set but only hits the start key of the table span.


pkg/storage/engine/compactor.go, line 39 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why not 2*time.Minute with no comment?

:transcendentbrain:


pkg/storage/engine/compactor.go, line 52 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The relationship between the two time parameters is not clear from these comments.

This whole process feels unnecessarily complex and reliant on magic numbers. I'm not sure that thresholdTimeSinceLastProcess makes sense - why should dropping a tiny table force a compaction after 2 hours (which I think implies rewriting at least one 128MB sstable)? If the table is too small, leave it for rocksdb to process naturally.

I agree, though I think to trust that process we should catch up on when RocksDB actually schedules compactions. It's not clear to me, and we've definitely seen it leave databases that would shrink from 15G to 100M alone for at least a day.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: all files reviewed at latest revision, 33 unresolved discussions, some commit checks failed.


pkg/storage/engine/compactor.go, line 52 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I agree, though I think to trust that process we should catch up on when RocksDB actually schedules compactions. It's not clear to me, and we've definitely seen it leave databases that would shrink from 15G to 100M alone for at least a day.

Maybe we should introduce a CompactRange RPC (and admin command, which would work online instead of the current offline compaction command) to let you force a compaction whenever rocksdb gets it wrong, even if it's not tied to a table drop. In fact, maybe the schema changer could use that RPC instead of doing all this scheduling triggered from the ClearRange RPC.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 11, 2017

Review status: all files reviewed at latest revision, 33 unresolved discussions, some commit checks failed.


pkg/storage/engine/compactor.go, line 52 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Maybe we should introduce a CompactRange RPC (and admin command, which would work online instead of the current offline compaction command) to let you force a compaction whenever rocksdb gets it wrong, even if it's not tied to a table drop. In fact, maybe the schema changer could use that RPC instead of doing all this scheduling triggered from the ClearRange RPC.

I think that makes sense, though that will require making ClearRange accumulate the stats (of the affected bytes) so that they can be passed into the compactor. (Straightforward, but wanted to mention it).


Comments from Reviewable

@spencerkimball
Copy link
Member Author

OK, removed use of DeleteFilesInRange.


Review status: 13 of 45 files reviewed at latest revision, 33 unresolved discussions.


c-deps/libroach/db.cc, line 1677 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

NYC, but this block is so large that I think it'd look better after an inversion:

if (max_level != db->rep->NumberLevels() - 1) {
  // There are no sstables at the lowest level, so just compact the
  // entire database. Due to the level_compaction_dynamic_level_bytes
  // setting, this will only happen on very small databases.
  return ToDBStatus(db->rep->CompactRange(options, NULL, NULL));
}

// all the nontrivial code

I'm also doubtful that you really want to update max_level only for the key ranges you're interested in. If I pass a key range, I'd never want a full compaction. Or perhaps you need to just pass start_key and end_key instead of NULL here. @petermattis will know best which one it is.

@petermattis? I've changed it to use start and end key.


c-deps/libroach/db.cc, line 1750 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

That's a good question, I wonder what RocksDB will make of this. It would ideally make this about as efficient as calling DeleteFilesInRange itself but we should check.

Will look into it, but for now removing DeleteFilesInRange.


pkg/keys/keys.go, line 65 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Does this computation make sense? The ascending encoding is escaping-based, isn't it? The buffer size here makes it look like it's length-encoded (which it seems you could get away with here, but it's not worth it).

I don't believe that the computation makes sense. I'll just let this inefficiently re-allocate as necessary instead.


pkg/keys/printer.go, line 195 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Not that it'll ever matter, but you can print start here in this message.

No longer applicable.


pkg/keys/printer_test.go, line 51 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

only a nit, but you could use backticks instead of quotes and then you don't have to escape quotes within the string: https://play.golang.org/p/_lBx1kzjfF

Done.


pkg/roachpb/api.proto, line 260 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why is this different from header?

...

OK, after finishing the rest of the review I see what you're trying to do here. Maybe call it entire_span at this level (with comments) and then cmd_clear_range.go turns it into a "compaction"

Removed this. The idea was to have the full table range for allowing DeleteFilesInRange a chance to do its work. Now that I've stopped using that, this no longer makes sense.


pkg/sql/tablewriter.go, line 822 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Good point. Yet more evidence that we shouldn't be DeleteFilesInRangeing yet.

BTW, a straightforward fix for this is to issue a second ClearRange separately which has the Clear field set but only hits the start key of the table span.

It was not intended to be truncated. The idea was that I wanted to be able to call DeleteFilesInRange on a span locally on RocksDB which might encompass many former ranges of data. The idea was to be able to drop sstables like flies when you drop a 6GiB table. Without sending the full table span from this level, there would be no way to reliably piece it together at the store level. It's a moot point now.


pkg/storage/engine/enginepb/compact.proto, line 23 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This proto is stored at a key that encodes both start key and end key, so I don't think they need to contain the keys a second time.

Removed.


pkg/storage/engine/enginepb/compact.proto, line 23 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: no trailing dot.

Done.


pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Won't be needed without DeleteFilesInRange.

I'm still going to send this as a hint. It's pertinent and I have a feeling it will come in useful.


pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is this preparing for the future? It looks like it's always set to true today.

Suggested Compaction messages will be sent when we GC enough older versions, when we rebalance a range, etc. Some clear all of the data (even some GCs over more traditional DeleteRange spans for example), while others may just free up enough old version cruft to warrant a compaction.


pkg/storage/engine/compactor.go, line 15 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Make this a new package engine/compactor

Done.


pkg/storage/engine/compactor.go, line 39 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

:transcendentbrain:

Done.


pkg/storage/engine/compactor.go, line 47 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Maybe we should compute this fraction using capacity.LogicalBytes instead of capacity.Used so we're doing apples-to-apples math. Otherwise, be clearer about how we're mixing units here: "Note that the numerator of this fraction is based on uncompressed data while the denominator is prefix- and snappy-compressed..."

Using logical bytes instead.


pkg/storage/engine/compactor.go, line 52 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think that makes sense, though that will require making ClearRange accumulate the stats (of the affected bytes) so that they can be passed into the compactor. (Straightforward, but wanted to mention it).

I've changed things so that the size-related thresholds activate if any contiguous collection of suggested compaction spans exceeds them, and dropped the time-based metric.


pkg/storage/engine/compactor.go, line 80 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Shouldn't this be log.Warningf?

Done.


pkg/storage/engine/compactor.go, line 93 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Doubt this should be Infof.

for testing...will remove.


pkg/storage/engine/compactor.go, line 95 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

timerSet = true

Done.


pkg/storage/engine/compactor.go, line 103 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…
if err != nil {
  log.Warningf(..) //(unless you'd ever expect to see this during normal operations)
} else if !ok {
  break
}

Done.


pkg/storage/engine/compactor.go, line 107 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This could race with a concurrent suggestion. We need to clear the timer before going into processSuggestions, so that any suggestions that arrive will queue up a new event.

Done.


pkg/storage/engine/compactor.go, line 109 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Move this line up, so that it's also hit in the !ok branch (but not the err != nil one).

Done.


pkg/storage/engine/compactor.go, line 124 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Make sure to open a trace at the caller (tracing.EnsureContext).

Done.


pkg/storage/engine/compactor.go, line 130 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

FWIW, I think you should add the required method to the Engine interface and make it a no-op for anything that doesn't support it.
I'm pretty sure we don't want to do DeleteFilesInRange just yet.

Done.


pkg/storage/engine/compactor.go, line 136 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

to manage Key and EndKey after removing them from the proto, add

type compaction struct {
  enginepb.Compaction
  StartKey, EndKey roachpb.Key
}

and populate them in this loop.

Needed this message struct anyway in ReplicatedEvalResult, so added a SuggestedCompaction message to storagebase.


pkg/storage/engine/compactor.go, line 156 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

a compaction

Done.


pkg/storage/engine/compactor.go, line 168 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Infof implies Eventf, so you don't need both.

This is kinda hard to read, perhaps factor it out:

shouldProcess := totalBytes >= thresholdBytes || totalBytes >= int64(float64(capacity.Used)*thresholdBytesFraction) || timeutil.Since(lastProcessed) >= thresholdTimeSinceLastProcess

if !shouldProcess {
  log.Infof(ctx, "skipping compaction with %d suggestions, total=%db, used=%db, last processed=%s",
			len(suggestions), totalBytes, capacity.Used, lastProcessed)
  return false, nil
}

All those infos were just to see what was going on when testing. Removed.


pkg/storage/engine/compactor.go, line 175 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Remove.

Done.


pkg/storage/engine/compactor.go, line 176 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'd log after having processed a compaction, and include the duration and its index (and make that the only log message emitted by the compactor in the common case):

tBegin = timeutil.Now()
// work
log.Infof(ctx, "processed compaction #%d/%d for %s in %s", i+1, len(suggestions), humanizeutil.IBytes(sc.Bytes), timeutil.Since(tBegin))

Done.


pkg/storage/engine/compactor.go, line 206 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

These can go.

Done.


pkg/storage/engine/compactor.go, line 230 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This would pick up startKey, andKey roachpb.Key.

Done.


pkg/storage/engine/compactor.go, line 246 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

log.Fatalf(ctx, "existing compaction %+v incompatible with suggested compaction %+v", existing, sc)

but I'm not sure this should be fatal error. You would just unset the Cleared bit. We know that for the table drop use case this wouldn't ordinarily happen, but that doesn't mean it's illegal to re-use the key space.

Now just using the most recent setting for the cleared bit.


pkg/storage/engine/compactor.go, line 248 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Isn't this double-counting? If the same compaction is suggested twice it should free up the same data.

(after finishing the review, I see that I was wrong here, but it shows that this is surprising and needs a comment)

It shouldn't be. These are only meant to be suggested at time where we're committing a change to the underlying storage engine that actually clears bytes, and those are accounted for in the MVCCStats. Added a comment.


pkg/storage/engine/compactor.go, line 253 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I would happily make that a warning.

Done.


Comments from Reviewable

@@ -17,6 +17,7 @@
#include <google/protobuf/stubs/stringprintf.h>
#include <mutex>
#include <rocksdb/cache.h>
#include <rocksdb/convenience.h>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer need this import

@petermattis
Copy link
Collaborator

I only looked at the C++ bits. Mostly looks good.


Reviewed 23 of 23 files at r1.
Review status: 13 of 45 files reviewed at latest revision, 38 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1636 at r4 (raw file):

uint64_t DBGetApproximateSize(DBEngine* db, DBKey start, DBKey end) {
  rocksdb::Range range(rocksdb::Slice(EncodeKey(start)), rocksdb::Slice(EncodeKey(end)));

s/rocksdb::Range/const rocksdb::Range/g

I'm mildly surprised you have to explicitly cast to rocksdb::Slice, but my C++ is rusty so maybe you do.


c-deps/libroach/db.cc, line 1653 at r4 (raw file):

  // settings like bloom filter configurations (which is the biggest
  // reason we currently have to use this function).
  options.bottommost_level_compaction = rocksdb::BottommostLevelCompaction::kForce;

Using bottom-most level compactions for DBCompactRange is going to be expensive. My inclination would be to leave DBCompact() unchanged and have DBCompactRange simply call DB::CompactRange() without any of the complexity of trying to split up the compaction. This is under the expectation that you'll generally be passing in range-size spans to DBCompactRange.


c-deps/libroach/db.cc, line 1665 at r4 (raw file):

  db->rep->GetLiveFilesMetaData(&all_metadata);

  std::string start_key(EncodeKey(start));

s/std::string/const std::string/g


c-deps/libroach/include/libroach.h, line 109 at r4 (raw file):

DBStatus DBSyncWAL(DBEngine* db);

// Uses GetApproximateSizes with include flags indicating only disk space

Let's reword this to not mention GetApproximateSize so the reader doesn't have to know what that method is about. Something like: Returns an estimate of the disk space used to store the specified range of keys.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: 13 of 45 files reviewed at latest revision, 36 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1636 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/rocksdb::Range/const rocksdb::Range/g

I'm mildly surprised you have to explicitly cast to rocksdb::Slice, but my C++ is rusty so maybe you do.

Done. Looks like I didn't need the explicit casts. Removed.


c-deps/libroach/db.cc, line 1653 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Using bottom-most level compactions for DBCompactRange is going to be expensive. My inclination would be to leave DBCompact() unchanged and have DBCompactRange simply call DB::CompactRange() without any of the complexity of trying to split up the compaction. This is under the expectation that you'll generally be passing in range-size spans to DBCompactRange.

As mentioned in offline conversation, the hope is that we send in huge spans after dropping a table – the compactor makes an effort to group together contiguous or near-contiguous key spans from the queued suggested compactions.

Now the question is, do we want to dynamically avoid using BottommostLevelCompaction::kForce in certain cases?


c-deps/libroach/db.cc, line 1665 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/std::string/const std::string/g

Done.


c-deps/libroach/include/libroach.h, line 109 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Let's reword this to not mention GetApproximateSize so the reader doesn't have to know what that method is about. Something like: Returns an estimate of the disk space used to store the specified range of keys.

Done.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Reviewed 12 of 23 files at r1, 11 of 30 files at r3, 21 of 21 files at r4, 2 of 2 files at r5.
Review status: all files reviewed at latest revision, 34 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1652 at r4 (raw file):

  // recompacting the bottom layer is necessary to pick up changes to
  // settings like bloom filter configurations (which is the biggest
  // reason we currently have to use this function).

The "biggest reason" part of this comment is no longer true.


pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):

	DefaultDeclareKeys(desc, header, req, spans)
	// We look up the range descriptor key to check whether the span
	// is equal to the entire range for fast stats updating.

Not just for stats updating: If the span is less than the entire range, we must not update the GCThreshold to the current time (and as a consequence, should we also fall back to MVCC-safe DeleteRange, or use ClearRange without any safeguards against later reads?)


pkg/storage/compactor/compactor.go, line 43 at r4 (raw file):

	// cleared when a big table is dropped, so the compactor can
	// determine contiguous stretches and efficient delete sstable files.
	compactionMinInterval = 2 * time.Minute // 2 minutes

This should probably be a cluster setting. I'm concerned that this is a new component that can run amok and having a quick way to more or less disable it would be a good idea.

We should also consider debuggability for any new component like this: there should be metrics and ideally a debug page about its activity. Any monitoring work we don't do now will be debt that we'll eventually have to repay.


pkg/storage/engine/compactor.go, line 52 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I've changed things so that the size-related thresholds activate if any contiguous collection of suggested compaction spans exceeds them, and dropped the time-based metric.

What do you think about the CompactRange RPC suggestion? I think that would be a simpler way of reclaiming space after dropping a table, and I think I'm more comfortable with that than a new "moving piece" like this compaction queue. This queue is potentially more generalizable to things like reclaiming space after MVCC GC, but it's not yet clear to me how exactly that would work and whether we'll be doing it any time soon.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 36 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1677 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

@petermattis? I've changed it to use start and end key.

Using start_key and end_key as @spencerkimball is doing looks correct.


c-deps/libroach/db.cc, line 1636 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done. Looks like I didn't need the explicit casts. Removed.

@tschottdorf Is adding an almost identical function in one of his PRs. You two should coordinate.


c-deps/libroach/db.cc, line 1653 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

As mentioned in offline conversation, the hope is that we send in huge spans after dropping a table – the compactor makes an effort to group together contiguous or near-contiguous key spans from the queued suggested compactions.

Now the question is, do we want to dynamically avoid using BottommostLevelCompaction::kForce in certain cases?

My instinct is that DBCompactRange should not specify BottommostLevelCompaction::kForce while DBCompact should. Easy enough to achieve with a bit of code reorganization to add another helper function.


c-deps/libroach/db.cc, line 1690 at r5 (raw file):

    const rocksdb::Slice start_slice(start_key);
    const rocksdb::Slice end_slice(end_key);
    return ToDBStatus(db->rep->CompactRange(options, start_key.length() > 0 ? &start_slice : nullptr,

s/start_key.length() > 0/!start_key.empty()/g

Ditto for end_key.length() > 0.


c-deps/libroach/include/libroach.h, line 109 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

See my other comment regarding @tschottdorf's addition of a similar function.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 13, 2017

Review status: all files reviewed at latest revision, 36 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1677 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Using start_key and end_key as @spencerkimball is doing looks correct.

@petermattis can you elaborate? Is that because looking at any key span will essentially give you at least one SSTable at the bottom level?


c-deps/libroach/db.cc, line 1636 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

@tschottdorf Is adding an almost identical function in one of his PRs. You two should coordinate.

I just merged mine. Sorry about the rebase, but should be straightforward.


c-deps/libroach/db.cc, line 1653 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

My instinct is that DBCompactRange should not specify BottommostLevelCompaction::kForce while DBCompact should. Easy enough to achieve with a bit of code reorganization to add another helper function.

@petermattis can you elaborate on the strategy here (ideally in comment form so that we can put it into the code)?

What I hope to understand is that if you have the key range corresponding to a bottom-most SSTable (one of the steps in DBCompact) and that SSTable is covered by multiple range-size ClearRanges (potentially higher up in the LSM), will the SSTable still be deleted without BottommostLevelCompaction::kForce?

I guess that's easy enough to test by running a RESTORE, running an offline compaction, and then dropping the table (and checking that the disk space gets completely released), but would be good to have it in writing.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 13, 2017

I didn't deep-dive into the code, but I agree with @bdarnell that adding a CompactRange RPC is more prudent at this point (the compactor has too much going on). I also left a few comments on handling ClearRanges that don't cover the full range. (I think we can get away with making them not do much and letting the schema changer run the slow delete path before declaring success).


Reviewed 30 of 30 files at r3, 19 of 21 files at r4, 2 of 2 files at r5.
Review status: all files reviewed at latest revision, 37 unresolved discussions, some commit checks failed.


pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Not just for stats updating: If the span is less than the entire range, we must not update the GCThreshold to the current time (and as a consequence, should we also fall back to MVCC-safe DeleteRange, or use ClearRange without any safeguards against later reads?)

With either of those, I'm worried that we might end up creating large proposals -- think above the Raft threshold size -- and so we need chunking, etc, which isn't available at this level.

You're worried only about the case in which the table hasn't been split off from its sibling(s), right? I wonder if we can delay the drop until that has happened. But that's additional complexity...

What if we returned an error here that would force the schema changer into the slow path? This would work well if the error occurs on the last range: the schema changer would retry, but all of the ranges are already empty, so many of the DeleteRanges are trivial. Less great if the first range of the table hasn't split off because then DistSender will short-circuit and not actually remove most of the data.

Third stream-of-consciousness-idea: we make "partial" ClearRanges no-ops that instead return a count back in the response (which gets aggregated by DistSender through the Combinable interface). The schema changer looks at the number returned and, if nonzero, runs the slow path.

I like that last idea because it's straightforward and should work well in practice.


pkg/storage/compactor/compactor.go, line 43 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This should probably be a cluster setting. I'm concerned that this is a new component that can run amok and having a quick way to more or less disable it would be a good idea.

We should also consider debuggability for any new component like this: there should be metrics and ideally a debug page about its activity. Any monitoring work we don't do now will be debt that we'll eventually have to repay.

+1 to metrics.

Also generally +1 to avoiding amok, though we're starting to accumulate the large heap of knobs that we once thought we'd avoid. Perhaps it's time to think about separating out knobs we want users to stumble upon, and other "never really touch" knobs. I'd really like to be able to test a component like this sufficiently to be confident in its ability to perform well under various workloads. We'll hopefully get there in the coming months.


pkg/storage/compactor/compactor.go, line 38 at r5 (raw file):

// indicates the minimum period of time


pkg/storage/compactor/compactor.go, line 46 at r5 (raw file):

(in observance of compactionMinInterval)


pkg/storage/compactor/compactor.go, line 51 at r5 (raw file):

(in observance of compactionMinInterval)


pkg/storage/compactor/compactor.go, line 264 at r5 (raw file):

		// Ask for an estimate of disk space utilized by the engine in the
		// gap between the existing aggregation and the new suggested compaction.
		gapSize := c.eng.GetApproximateSize(

This feels over-engineered and makes me less confident that this will work equally well in the multitude of scenarios out there. Just merge key spans and run compactions on the connected components.


pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I'm still going to send this as a hint. It's pertinent and I have a feeling it will come in useful.

Useful for what? When you need it, you can just introduce it. Until then, it raises questions on how to merge these spans and what setting the flag does. It does nothing and there are no plans for that to change, so better to remove it.


pkg/storage/engine/compactor.go, line 52 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What do you think about the CompactRange RPC suggestion? I think that would be a simpler way of reclaiming space after dropping a table, and I think I'm more comfortable with that than a new "moving piece" like this compaction queue. This queue is potentially more generalizable to things like reclaiming space after MVCC GC, but it's not yet clear to me how exactly that would work and whether we'll be doing it any time soon.

I agree with Ben that it seems more appropriate at this point to issue an RPC.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

I still believe this is the right approach. See below for my response to Ben's suggestion.


Review status: 8 of 23 files reviewed at latest revision, 18 unresolved discussions.


c-deps/libroach/db.cc, line 1677 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

@petermattis can you elaborate? Is that because looking at any key span will essentially give you at least one SSTable at the bottom level?

@petermattis, yes please elaborate...


c-deps/libroach/db.cc, line 1636 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I just merged mine. Sorry about the rebase, but should be straightforward.

Done.


c-deps/libroach/db.cc, line 1652 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The "biggest reason" part of this comment is no longer true.

Done.


c-deps/libroach/db.cc, line 1653 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

@petermattis can you elaborate on the strategy here (ideally in comment form so that we can put it into the code)?

What I hope to understand is that if you have the key range corresponding to a bottom-most SSTable (one of the steps in DBCompact) and that SSTable is covered by multiple range-size ClearRanges (potentially higher up in the LSM), will the SSTable still be deleted without BottommostLevelCompaction::kForce?

I guess that's easy enough to test by running a RESTORE, running an offline compaction, and then dropping the table (and checking that the disk space gets completely released), but would be good to have it in writing.

Yes, I'm also not clear on this. My understanding is that if we don't force the bottommost range, we won't clear up space for old tables after dropping them. Could you clarify?


c-deps/libroach/db.cc, line 1690 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/start_key.length() > 0/!start_key.empty()/g

Ditto for end_key.length() > 0.

Done.


c-deps/libroach/include/libroach.h, line 109 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

See my other comment regarding @tschottdorf's addition of a similar function.

Done.


pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

With either of those, I'm worried that we might end up creating large proposals -- think above the Raft threshold size -- and so we need chunking, etc, which isn't available at this level.

You're worried only about the case in which the table hasn't been split off from its sibling(s), right? I wonder if we can delay the drop until that has happened. But that's additional complexity...

What if we returned an error here that would force the schema changer into the slow path? This would work well if the error occurs on the last range: the schema changer would retry, but all of the ranges are already empty, so many of the DeleteRanges are trivial. Less great if the first range of the table hasn't split off because then DistSender will short-circuit and not actually remove most of the data.

Third stream-of-consciousness-idea: we make "partial" ClearRanges no-ops that instead return a count back in the response (which gets aggregated by DistSender through the Combinable interface). The schema changer looks at the number returned and, if nonzero, runs the slow path.

I like that last idea because it's straightforward and should work well in practice.

I think it's best to simply not update the GCThreshold if we run into the partial range case. This is supposed to be safe, according to the contract which this API call explicitly requires: that the key span being cleared is not to be read or written subsequent to the call. The update of the GCThreshold is just an attempt to get back errors which we can track down if someone down the road changes something in SQL land that breaks the contract.


pkg/storage/compactor/compactor.go, line 43 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

+1 to metrics.

Also generally +1 to avoiding amok, though we're starting to accumulate the large heap of knobs that we once thought we'd avoid. Perhaps it's time to think about separating out knobs we want users to stumble upon, and other "never really touch" knobs. I'd really like to be able to test a component like this sufficiently to be confident in its ability to perform well under various workloads. We'll hopefully get there in the coming months.

I added an environment variable so we can disable it. I've also added metrics, though I think a debug page is too much for this PR.


pkg/storage/compactor/compactor.go, line 38 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

// indicates the minimum period of time

Done.


pkg/storage/compactor/compactor.go, line 46 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

(in observance of compactionMinInterval)

Done.


pkg/storage/compactor/compactor.go, line 51 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

(in observance of compactionMinInterval)

Done.


pkg/storage/compactor/compactor.go, line 264 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This feels over-engineered and makes me less confident that this will work equally well in the multitude of scenarios out there. Just merge key spans and run compactions on the connected components.

But we don't know for sure things are connected because they often don't overlap. For example, dropping a large table with lots of nodes will surely connect few cleared ranges – we really need to take a look at what's between the gaps. This seems reasonable to me. Would you like to suggest a smaller threshold than 64M? I chose that because that's the amount we feel confident writing to disk and moving around on the network. Seems reasonable that we'd be willing to rewrite that amount in order to compact and reclaim 128M.


pkg/storage/engine/enginepb/compact.proto, line 28 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Useful for what? When you need it, you can just introduce it. Until then, it raises questions on how to merge these spans and what setting the flag does. It does nothing and there are no plans for that to change, so better to remove it.

OK


pkg/storage/engine/compactor.go, line 52 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I agree with Ben that it seems more appropriate at this point to issue an RPC.

A CompactRange RPC has drawbacks which made me choose this approach instead.

  • The RPC needs to be sent to all nodes in the cluster. If you'd prefer to avoid sending to all nodes, you'd need to read through the range metadata for all impacted nodes. What happens if a node is down?
  • Failing that, you'd need to send the CompactRange RPC as a ranged command via the dist sender and encounter potentially significant redundancy. Each store would need some debouncing mechanism to account for that. Two ideas for this mechanism: 1) send a Put command to all replicas, and then process from a compaction queue (similar to what this does already), or 2) introduce a command with a replicated eval result so that all stores which have replicas of the range are notified and can run the compaction on receipt. The second approach would have no store-wide coordination available to rate limit or otherwise provide flow control – by contrast, the compactor serializes all compactions for each store.
  • The CompactRange RPC approach doesn't work when changing the zone config for a table or database, causing ranges to relocate from a node or nodes, because there's no clear way to know when to send the CompactRange RPC. The compactor approach handles this case.
  • The CompactRange RPC approach fails to handle the case of significant data culling through the use of SQL commands like DELETE FROM <table> WHERE X>10. The compactor will receive suggestions after the GC queue sends non-trivial GCRequests.

The compactor has complexity (i.e. another goroutine, and merging of contiguous suggestions), but I think the CompactRange RPC alternative(s) have both complexity (i.e. some combination of: new RPC, new Raft Command, pseudo-queue, debouncing, all-nodes-commands, etc.) and less flexibility to handle space reclamation in scenarios which will clearly impact our users (e.g. migration, manual or automatic large-scale row deletion from SQL).


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 7 of 22 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1677 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

@petermattis, yes please elaborate...

We specify level_compaction_dynamic_level_bytes which causes RocksDB to add sstables to the bottom-most level very quickly. If we don't have any sstables in the bottom-most level, running DB::CompactRange on the input range seems fine as there is likely very little data in the system.

Does that answer your questions? Perhaps I'm not understanding the concern here.


c-deps/libroach/db.cc, line 1653 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Yes, I'm also not clear on this. My understanding is that if we don't force the bottommost range, we won't clear up space for old tables after dropping them. Could you clarify?

Re-reading the RocksDB doc comments, I'm actually not sure what will happen if we don't force a bottom-most compaction. If you're concerned about freeing the space, then forcing a bottom-most compaction seems safest. Might be worth experimenting and then documenting whether it is necessary or not.


c-deps/libroach/db.cc, line 1690 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

Did you forget to push this change?


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 15, 2017

I added another comment to that discussion. TL;DR is that I still think the RPC is the right approach (at least for our upcoming release). The queue might win out in the long run, but there's not a strong case for introducing it now as far as I can tell.


Review status: 7 of 22 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1677 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We specify level_compaction_dynamic_level_bytes which causes RocksDB to add sstables to the bottom-most level very quickly. If we don't have any sstables in the bottom-most level, running DB::CompactRange on the input range seems fine as there is likely very little data in the system.

Does that answer your questions? Perhaps I'm not understanding the concern here.

I see why I'm confused: I thought the CompactRange below would still run a full compaction, but it really only runs it on the specified key range in that case, so no problem.


c-deps/libroach/db.cc, line 1694 at r6 (raw file):

  if (max_level != db->rep->NumberLevels() - 1) {
    // There are no sstables at the lowest level, so just compact the
    // entire database. Due to the level_compaction_dynamic_level_bytes

s/entire database/specified keyspan wholesale/


c-deps/libroach/db.cc, line 1695 at r6 (raw file):

// ..., this will only happen on key spans containing little data.


pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I think it's best to simply not update the GCThreshold if we run into the partial range case. This is supposed to be safe, according to the contract which this API call explicitly requires: that the key span being cleared is not to be read or written subsequent to the call. The update of the GCThreshold is just an attempt to get back errors which we can track down if someone down the road changes something in SQL land that breaks the contract.

I'd much prefer if this this call weren't the first to have a new set of assumptions that need to hold for it to be safe. It seems that we can avoid it (see option three), so I think we should (and in turn remove the assumptions from the comments).


pkg/storage/engine/compactor.go, line 52 at r2 (raw file):

There are some good points there, but I'm skeptical. Consider the following instead:

  • remove the queue (for now).
  • put a map[StoreID]struct{} into ClearRangeResponse which is populated with the stores that hold a replica of the respective range (and aggregated via the combinable interface).
  • then, after the fast-path delete, you can issue a new SuggestCompaction RPC to the respective nodes for the respective stores (for the full key span). At the current time, SuggestCompaction will just run the compaction right there.
  • if a node is down, tough luck. I think it's reasonable not to handle this optimally right now. That specific RPC will fail, but the schema change will still announce success.

I agree that there is some general benefit to the compactor queue in the long run (and when it becomes truly useful, we can change what SuggestCompaction does transparently and potentially also suggest from ClearRange but I still think it has too many moving parts for now that can cause trouble down the road. It's also not clear that other benefits, as triggering compactions from the GC queue, are going to make sense to add ultimately (though I like the idea).

PS: I didn't understand what you mean with

The CompactRange RPC approach doesn't work when changing the zone config for a table or database, causing ranges to relocate from a node or nodes, because there's no clear way to know when to send the CompactRange RPC. The compactor approach handles this case.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Thank you for the review. I remain convinced this is the right approach...especially given that you agree already this will eventually be necessary to cleanup after aggressive GC activity. I will continue making my case:


Your suggested approach works for the case of DROP | TRUNCATE TABLE, but it doesn't provide us with a path to a viable solution for reclaiming space when data is migrated off of a store or when GC frees up significant previous versions or a span of tombstone'd keys caused by conventional use of DELETE SQL statements. My intention is to handle these two cases in follow up PRs. They're small additional adjustments to suggest new compaction spans, but it seems prudent they be added serially, each with appropriate testing, with the GC queue suggestions coming last.

The overall goal here is straightforward: CockroachDB should reclaim disk space within reasonable operator expectations for timing, in situations where they'll be expecting an RDBMS to do so. So far, we've failed on this score and it's left egg on our faces. It's of course not reasonable or possible to do the most efficient implementations right off the bat – and it's certainly disappointing that RocksDB doesn't do a better job of scheduling these itself.

However, this PR is the foundation for a comprehensive solution to this problem. I don't believe it makes sense to introduce a solution which targets only 1/3 of the identified problem areas, and wait for the remainder to bite us before we must fix them for v2.1 (e.g. handle DROP TABLE, but ignore rebalancing and large-scale GC). That's another six months where we have to explain to people that they need to wait for RocksDB to background compact.

Here are the user stories, to be explicit about it:

"I just followed the demo on the website for migration between clouds, but now both my nodes in GCP and AWS are using significant disk space – in fact, the original GCP nodes, which should be mostly empty, are using more disk space than the AWS nodes. If I migrate a table off of GCP, why doesn't it free up the disk space? The admin UI shows the data has moved but when I du -s the storage directories on my GCP nodes, the disk space just continues to grow slowly."

"I set my TTL for table TransactionLeg to 3600s and my application deleted all entries using a loop running DELETE FROM TransactionLeg LIMIT 10000. It's been 10 hours now and the disk space has only increased, even though nothing else is being done on the cluster. It should have freed up 2G of logical bytes, and the admin UI says that it did, but the disk space seems to only go up."


When we have a solution that handles all three cases, why choose to fix just one case? It sounds like we're worried about complexity. Let me address that here.

  • This is a wholly contained module pkg/storage/compactor/compactor.go containing 385 lines of code. Nothing depends on it.
  • It uses one goroutine per store, a "moving part". That moves the rough count of moving parts from 27 to 28 (Txn sender stats loop, SQL lease manager, Gossip server, Gossip processor, Gossip bootstrap, Gossip clients, Raft scheduler, Raft scheduler workers (x64), Raft tick loop, Coalesced heartbeat loop, Store gossip update listener, Replica scanner loop, Replica queues (x8), Queue purgatory loops (x8), Node periodic metrics, Server settings refresh loop, Server env sampling loop, Time series DB, Node summaries reporting, DistSQL planner runners (x16), Schema change manager, SQL executor, DistSQL flow scheduler, Node liveness heartbeat, SQL jobs registry (x2), Server update checker, RPC heartbeat)
  • This module interacts only with RocksDB, and only to instruct it to run CompactRange.

To gauge the danger of this approach (what if the compactor runs wild?), I've done a small experiment to ascertain the performance hit on the system while asking RocksDB to compact full-range key spans, one at a time, across a 200MiB+ database, all ranges with less than 64MiB of logical bytes, all forcing compaction to the bottommost level. Note the impact on KV benchmark with 95% read (note the arrow below shows where the compact range loop starts (1 minute into process):

go run main.go --batch=100 --min-block-bytes=10000 --max-block-bytes=20000 --read-percent=95 --max-rate=100

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     41s        0        10000.0         9999.7      6.0      7.1     10.0     11.0
     42s        0         9999.5         9999.8      6.0      7.3     11.0     11.5
     43s        0        10000.5         9999.8      6.0      8.1     10.5     13.6
     44s        0         9986.7         9999.5      6.0      7.3     10.5     11.5
     45s        0        10013.2         9999.8      6.0      7.6     10.0     10.0
     46s        0        10000.7         9999.8      6.0      7.3      8.9     11.0
     47s        0         9999.3         9999.8      5.8      7.1      8.9     10.5
     48s        0         9999.6         9999.8      6.0      7.9      9.4     10.0
     49s        0        10000.8         9999.8      6.0      7.6      9.4     12.1
     50s        0         9991.3         9999.6      5.8      8.1     12.1     12.6
     51s        0        10008.8         9999.8      5.8      7.3      9.4     11.0
     52s        0        10000.0         9999.8      6.0      8.4     10.5     10.5
     53s        0        10001.2         9999.8      6.0      7.6      8.4     10.0
     54s        0         9980.7         9999.5      6.0      7.6     10.5     12.1
     55s        0        10019.2         9999.8      6.0      8.4     12.1     13.1
     56s        0         9999.8         9999.8      5.8      8.9     12.1     12.1
---> 57s        0         9999.4         9999.8      5.8      8.9     17.8     19.9
     58s        0        10000.2         9999.8      6.0      8.4     15.2     22.0
     59s        0        10000.4         9999.9      6.0      8.1     10.5     11.0
    1m0s        0         9998.6         9999.8      6.3     10.0     10.5     12.1
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
    1m1s        0         9998.7         9999.8      5.8      7.6      8.9      8.9
    1m2s        0        10001.1         9999.8      6.0      8.9     12.1     17.8
    1m3s        0        10000.8         9999.8      5.8      8.1      9.4     11.0
    1m4s        0        10000.0         9999.8      5.8     10.0     12.6     12.6
    1m5s        0         9990.5         9999.7      5.8      8.4      9.4      9.4
    1m6s        0        10009.5         9999.9      6.0      8.1     11.0     12.1
    1m7s        0        10000.9         9999.9      5.8      7.9     10.5     11.5
    1m8s        0        10000.0         9999.9      6.0      8.4     12.6     19.9
    1m9s        0        10000.0         9999.9      5.8     10.5     12.1     12.1
   1m10s        0         9983.3         9999.6      6.0      7.3      9.4     11.5
   1m11s        0        10021.4         9999.9      6.3     10.0     12.6     15.7
   1m12s        0         9995.4         9999.9      5.8      7.9     10.0     11.0
   1m13s        0         9998.5         9999.9      6.3     11.5     16.8     56.6
   1m14s        0        10001.5         9999.9      5.8      7.9     10.5     10.5

And for 100% write the performance hit is more pronounced:

go run main.go --batch=100 --min-block-bytes=10000 --max-block-bytes=20000 --read-percent=0 --max-rate=100
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
     41s        0        10004.1         9809.6      8.9     11.5     17.8     21.0
     42s        0         9999.7         9814.1      8.9     11.5     14.7     15.2
     43s        0         9997.7         9818.4      9.4     12.6     17.8     17.8
     44s        0         9996.4         9822.5      8.9     11.0     14.2     14.7
     45s        0        10004.6         9826.5      8.9     12.1     13.6     16.3
     46s        0         9995.4         9830.2      8.9     11.0     13.6     17.8
     47s        0         9999.9         9833.8      8.9     11.0     14.7     15.2
     48s        0        10003.4         9837.3      8.9     15.2     23.1     24.1
     49s        0         9899.5         9838.6      8.9     17.8     28.3     29.4
     50s        0        10100.0         9843.8     12.1     17.8     21.0     46.1
     51s        0        10001.2         9846.9      9.4     14.2     22.0     24.1
     52s        0        10002.0         9849.9      8.9     13.1     17.8     17.8
     53s        0        10001.7         9852.8     10.0     16.3     22.0     23.1
     54s        0         9999.8         9855.5      9.4     15.7     35.7     35.7
     55s        0         9998.4         9858.1     10.0     19.9     23.1     24.1
---> 56s        0        10001.3         9860.6      9.4     41.9     60.8    134.2
     57s        0         9998.3         9863.0      9.4     18.9     22.0     26.2
     58s        0        10001.7         9865.4     10.0     41.9     50.3     52.4
     59s        0        10000.9         9867.7     10.5     16.8     26.2     26.2
    1m0s        0         9897.8         9868.2     10.0     22.0     30.4     37.7
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
    1m1s        0        10101.8         9872.1     10.5     18.9     33.6     39.8
    1m2s        0         9992.1         9874.0      8.9     11.0     19.9     19.9
    1m3s        0        10003.7         9876.1     10.0     16.3     22.0     25.2
    1m4s        0         9997.2         9878.0      8.9     13.1     17.8     17.8
    1m5s        0         9998.7         9879.8     10.5     17.8     23.1     26.2
    1m6s        0        10001.0         9881.6      9.4     16.8     18.9     24.1
    1m7s        0        10001.1         9883.4     12.6     28.3     33.6     39.8
    1m8s        0        10006.5         9885.2      8.9     11.0     14.2     15.2

The good news is that RocksDB apparently doesn't naively rewrite SSTables on CompactRange. Re-running the same compactions seem to have little impact. More, larger-scale tests are in order, especially before automatically compacting after significant GC activity. I'm going to get to the 6G table drop soon.

The bottom line is that these three situations where space-reclamation should occur can be handled by this mechanism in CockroachDB v2.0, and I don't think we should settle for a solution which handles only one of three.


Review status: 7 of 22 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1653 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Re-reading the RocksDB doc comments, I'm actually not sure what will happen if we don't force a bottom-most compaction. If you're concerned about freeing the space, then forcing a bottom-most compaction seems safest. Might be worth experimenting and then documenting whether it is necessary or not.

Will experiment with this.


c-deps/libroach/db.cc, line 1690 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you forget to push this change?

Didn't change it...I only changed the previous one. Fixed.


c-deps/libroach/db.cc, line 1694 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/entire database/specified keyspan wholesale/

Done.


c-deps/libroach/db.cc, line 1695 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

// ..., this will only happen on key spans containing little data.

Done.


pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'd much prefer if this this call weren't the first to have a new set of assumptions that need to hold for it to be safe. It seems that we can avoid it (see option three), so I think we should (and in turn remove the assumptions from the comments).

Option three doesn't allow us to remove the assumptions from this method's API.proto comments. You absolutely cannot read or write concurrently, or read after, from this span when ClearRange is invoked. It requires knowledge by the caller that that won't happen, and that results if it does will be undefined. ClearRange is an optimization which clears away many of the safety assumptions the rest of the API provides.

Clearing a span which is just a subset of a range is exactly as (un)safe as clearing the whole range. Why should we fall back to the slow path?

If it makes things more palatable for you, I could just remove setting of the GC threshold entirely. It's a fig leaf.


pkg/storage/engine/compactor.go, line 52 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

There are some good points there, but I'm skeptical. Consider the following instead:

  • remove the queue (for now).
  • put a map[StoreID]struct{} into ClearRangeResponse which is populated with the stores that hold a replica of the respective range (and aggregated via the combinable interface).
  • then, after the fast-path delete, you can issue a new SuggestCompaction RPC to the respective nodes for the respective stores (for the full key span). At the current time, SuggestCompaction will just run the compaction right there.
  • if a node is down, tough luck. I think it's reasonable not to handle this optimally right now. That specific RPC will fail, but the schema change will still announce success.

I agree that there is some general benefit to the compactor queue in the long run (and when it becomes truly useful, we can change what SuggestCompaction does transparently and potentially also suggest from ClearRange but I still think it has too many moving parts for now that can cause trouble down the road. It's also not clear that other benefits, as triggering compactions from the GC queue, are going to make sense to add ultimately (though I like the idea).

PS: I didn't understand what you mean with

The CompactRange RPC approach doesn't work when changing the zone config for a table or database, causing ranges to relocate from a node or nodes, because there's no clear way to know when to send the CompactRange RPC. The compactor approach handles this case.

Hopefully my comment at the top addresses your suggestion. I think what you've proposed here is elegant and at least slightly simpler than what I'm doing. But it doesn't set us up to solve two other major problems with our current behavior which I'm resolved to address for v2.0. I would like to put the larger problem here out of its misery. I further expect that the three areas I've identified will not be the only three.


Comments from Reviewable

@bdarnell
Copy link
Contributor

My main concern is that this is over-engineered for the DROP TABLE case, so if we're moving ahead with broader use of this queue, I'm OK with it. But what's being dropped to make room for all of this in 2.0?


Reviewed 14 of 24 files at r6, 2 of 2 files at r7.
Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):

You're worried only about the case in which the table hasn't been split off from its sibling(s), right? I wonder if we can delay the drop until that has happened. But that's additional complexity...

In addition to the desire to use this on sub-range parts of tables, I think we might want to stop auto-splitting all tables (until we can merge empty ranges).

If it makes things more palatable for you, I could just remove setting of the GC threshold entirely. It's a fig leaf.

Yeah, I'd rather stop setting this. If it's safe to not set it for partial range operations, it's safe to not set it for whole range operations.


pkg/storage/compactor/compactor.go, line 270 at r7 (raw file):

		shouldDelete = true
	} else {
		// If the newest suggested compaction in the aggregation is

Basing this on the newest rather than the oldest could result in an endlessly-growing set of suggestions (which is one kind of "running amok", not just endless compactions).


pkg/storage/compactor/compactor.go, line 287 at r7 (raw file):

	clearStart := engine.MVCCKey{Key: keys.StoreSuggestedCompactionKey(aggr.StartKey, nil)}
	clearEnd := engine.MVCCKey{Key: keys.StoreSuggestedCompactionKey(aggr.EndKey, nil)}
	if err := delBatch.ClearRange(clearStart, clearEnd); err != nil {

Use ClearIterRange instead of ClearRange. The range tombstones left by ClearRange have a cost. (Feel free to rename these methods to indicate that ClearIterRange is the preferred default)


pkg/storage/compactor/compactor.go, line 369 at r7 (raw file):

	// storage engine. All such actions really do result in successively
	// more bytes being made available for compaction, so there is no
	// double-counting if the same range were cleared twice.

There's no double-counting, but it's racy. We may end up over-counting the bytes available to reclaim in a span we just compacted. Something else to watch out for.


pkg/storage/compactor/compactor_test.go, line 63 at r7 (raw file):

}

//

This test looks unfinished.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 17, 2017

@spencerkimball thanks for running the experiments. Curious how bigger tables turn out.

re: the user stories: I don't dispute that in the long run we probably want to do more than just run a compaction when a table is dropped. I can see fairly directly how this would work after replica GC (the cloud migration case), and the queue seems like a good fit for that. It's much less straightforward to make this work for compactions triggered by the GC queue. For the drop table case though, your solution seems strictly worse than the RPC. For example, if a node has two replicas of a range with lots of empty space in between (for which it used to have replicas, i.e. there's lots of data), you won't compact them in one go. That suggests that even with the compactor, we may still want to introduce the RPC and issue it after the DROP to make sure it gets the "right" suggestion. This seems less of an issue once you have suggestions on replica removal, but even then it's a roundabout way to try to reassemble perfect information held initially.

That said, since you're hellbent on this anyway and I can't spend more cycles pushing for more pragmatism, here's my suggestion wishlist:

  1. we keep the current compactor (the item that convinces me is how straightforward it should be to hook into it from replicaGCQueue)
  2. we remove GetApproximateSizes from the PR (replacing it by the right logic that looks at SSTables directly to check for adjacency)
  3. we introduce the RPC that suggests a compaction to the compactor (that RPC is our escape hatch for users that just need a compaction that won't happen on its own in 2.0). Then, if necessary, we can introduce the compaction call after the schema changer, even in a patch release, without too much trouble. This item can happen after this PR.
  4. It'd be good to be able to introspect the compactor (for programmatic testing, diagnostics, etc). In the spirit of cli/user,zone,node: run user, zone, node commands through SQL #20713, that suggests introducing an introspection API call to Compactor (returning a list of compactions) and SQL vtable crdb_internal.node_compactor which calls into the store's compactors. (This is straightforward to set up).

I'm also slightly worried about long keys in the compactor. For example, if you have a large table with multi-mb primary keys (and thus potentially multi-mb split keys), and drop that table, you get a lot of compactions (with multi-mb keys) and the compactor will pull them all into memory (I think the right solution here is to avoid long split keys, though).


Review status: all files reviewed at latest revision, 22 unresolved discussions, some commit checks failed.


pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):

Option three doesn't allow us to remove the assumptions from this method's API.proto comments. You absolutely cannot read or write concurrently, or read after, from this span when ClearRange is invoked. It requires knowledge by the caller that that won't happen, and that results if it does will be undefined. ClearRange is an optimization which clears away many of the safety assumptions the rest of the API provides.

From the SQL standpoint, yes. But it retained the KV transactional guarantees (on a per-range basis) by preventing incorrect reads/writes.

I agree with you both though that there seems to be no salvaging that in a meaningful way due to the partial range issue. Let's remove the fig leaf.


pkg/storage/compactor/compactor.go, line 44 at r7 (raw file):

	// compactor can determine contiguous stretches and efficient delete
	// sstable files.
	defaultCompactionMinInterval = 2 * time.Minute

For the rebalancing case, I'm not sure 2min will cut it. For a large table, when replicas are removed in random order, you'll have to wait a while to see contiguous patches. Just something to watch out for. Perhaps being 100% optimal isn't what we're going for anyway.


pkg/storage/compactor/compactor.go, line 49 at r7 (raw file):

	// reclamation, after which the compactor will begin processing
	// (taking compactor min interval into account).
	defaultThresholdBytes = 128 << 20 // more than 128MiB will trigger

Mention that this is tuned for our L6 target SSTable size.


pkg/storage/compactor/compactor.go, line 54 at r7 (raw file):

	// used which are up for suggested reclamation, after which the
	// compactor will begin processing (taking compactor min interval into account).
	defaultThresholdBytesFraction = 0.10 // more than 10% of space will trigger

I'm not sure this is useful. If it is, could you explain when?


pkg/storage/compactor/compactor.go, line 59 at r7 (raw file):

	// between two suggested compaction spans, before which two consecutive
	// suggested compactions will be merged and compacted together.
	defaultMaxGapBytesThreshold = 64 << 20 // will aggregate if gap <= 64MiB

Explain the rationale behind this number. Isn't 128MB (our L6 SSTable target) a better heuristic?

Also, using GetApproximateSizes looks like the wrong approach. What you really care about is whether the spans together overlap a contiguous set of SSTables; this suggests that you should write some special-cased code that looks at the SSTable metadata, similar to what DBCompactRange does now.

That would also make me feel less concerned about the heuristic in general and would remove this ad-hoc number plus relying on ApproximateSizes.


pkg/storage/compactor/compactor.go, line 64 at r7 (raw file):

	// suggested compaction record. If not processed within this time
	// interval since the compaction was suggested, it will be deleted.
	defaultMaxSuggestedCompactionRecordAge = 24 * time.Hour

This is tuned to the default GC TTL, but with slow enough churn (and GC queue hints) you'd never compact. I know this isn't a goal in this PR, but it is somehow your goal in introducing this mechanism in the first place (admittedly it'll work well for the replicaGCQueue), and so I'm curious to understand how you plan to make this work well with GCQueue suggestions. Not a blocker, but ISTM that a better place to store the suggestions would be in a range-local key that is managed by the GCQueue, where it would send a compaction hint only when it's reasonably sure that a compaction is now warranted (as opposed to putting stuff in this compactor early and letting it endlessly deal with it). But even that is kinda frail, because RocksDB runs its own compactions too and we don't want to duplicate work.


pkg/storage/compactor/compactor.go, line 77 at r7 (raw file):

}

var defaultCompactorOptions = compactorOptions{

I'd make this a method, for nobody should be able to mutate it:

func defaultCompactorOptions() compactorOptions {
  return ...
}

pkg/storage/compactor/compactor.go, line 204 at r7 (raw file):

	// becoming too old, at which point they are discarded without
	// compaction.
	delBatch := c.eng.NewWriteOnlyBatch()

do you need a defer delBatch.Close() in case the commit fails?


pkg/storage/compactor/compactor.go, line 218 at r7 (raw file):

		// it if it meets compaction thresholds.
		if done := c.aggregateCompaction(ctx, &aggr, sc); done {
			processedBytes, err := c.processCompaction(ctx, aggr, capacity, delBatch, startIdx, i, suggestCount)

Weird that you need to pass startIdx, i, and suggestCount in here.


pkg/storage/compactor/compactor.go, line 245 at r7 (raw file):

func (c *Compactor) processCompaction(
	ctx context.Context,
	aggr storagebase.SuggestedCompaction,

pass the following instead

struct aggregatedCompaction {
  storagebase.SuggestedCompaction
  seq int
}

func (ac aggregatedCompaction) String() string {
  return fmt.Sprintf("compaction %d (%s-%s) for %s", ac.seq, ac.StartKey, ac.EndKey, humanizeutil.IBytes(ac.Compaction.Bytes)")
}

and remove startIdx, curIdx, length.

I'm not sure we need capacity either, as mentioned elsewhere.


pkg/storage/compactor/compactor.go, line 252 at r7 (raw file):

	shouldDelete := false
	shouldProcess := aggr.Compaction.Bytes >= c.opts.ThresholdBytes ||
		aggr.Compaction.Bytes >= int64(float64(capacity.LogicalBytes)*c.opts.ThresholdBytesFraction)

is that second part ever useful?


pkg/storage/compactor/compactor.go, line 263 at r7 (raw file):

		}
		log.Eventf(ctx, "processed suggested compaction #%d/%d (%s-%s) for %s bytes in %s",
			startIdx, curIdx, length, aggr.StartKey, aggr.EndKey, humanizeutil.IBytes(aggr.Compaction.Bytes), timeutil.Since(startTime))

Message seems off. startIdx seems like it should be removed.

timeutil.Since(startTime) looks like it should go to metrics to measure long-running compactions.

After my suggestion above, this reads something like

duration := timeutil.Since(startTime)
c.Metrics.CompactingNanos.Inc(int64(duration))
log.Eventf(ctx, processed %s in %s", aggr, duration)

pkg/storage/compactor/compactor.go, line 270 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Basing this on the newest rather than the oldest could result in an endlessly-growing set of suggestions (which is one kind of "running amok", not just endless compactions).

It couldn't be endless because aggregating more and more increases the compaction size and it would hit the threshold.

But it sure seems (at least once the GC queue gives compaction hints) that we could end up with a large amount of suggestions. Say we have 1000 replicas that are far apart and slowly accruing garbage. So every hour or so we receive 1000 compaction hints for small amounts.


pkg/storage/engine/enginepb/compact.proto, line 1 at r7 (raw file):

// Copyright 2017 The Cockroach Authors.

Should move to compactor.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

spencerkimball commented Dec 18, 2017

With scalefactor=1 for tpch, it compacted a dropped 1.3GiB table (tpch.lineitem) in about 23s. The concurrent kv loadgen showed (roughly between the arrows...I don't have full timing):

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   3m41s        0         9999.9         9968.8     10.0     12.1     17.8     17.8
   3m42s        0         9998.5         9968.9     10.0     13.1     15.2     17.8
   3m43s        0        10001.1         9969.0      9.4     11.5     13.1     14.7
   3m44s        0         9993.6         9969.1     10.0     13.1     16.8     17.8
   3m45s        0        10002.8         9969.3     11.0     21.0     25.2     28.3
   3m46s        0         9999.7         9969.4     10.0     15.2     18.9     25.2
   3m47s        0        10003.4         9969.6     10.0     11.5     12.1     13.1
   3m48s        0         9987.4         9969.7     11.0     17.8     22.0     23.1
-->3m49s        0        10012.5         9969.8     10.0     27.3     48.2     92.3
   3m50s        0         9900.7         9969.5     11.5     25.2     44.0     56.6
   3m51s        0        10095.9         9970.1     13.6     24.1     30.4     32.5
   3m52s        0        10004.3         9970.2     10.5     13.6     15.2     19.9
   3m53s        0        10000.4         9970.4     10.0     12.1     13.6     14.2
   3m54s        0         9997.4         9970.5     10.5     15.2     16.8     22.0
   3m55s        0        10102.5         9971.0      9.4     12.6     13.6     15.7
   3m56s        0         9893.9         9970.7      9.4     14.2     16.3     18.9
   3m57s        0        10005.0         9970.9     11.5     21.0     24.1     25.2
   3m58s        0         9900.8         9970.6     12.6     24.1     28.3     30.4
   3m59s        0        10098.2         9971.1     11.0     18.9     26.2     31.5
    4m0s        0        10000.3         9971.2     11.5     28.3     35.7     35.7
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
    4m1s        0        10001.7         9971.4     10.5     15.7     22.0     23.1
    4m2s        0         9999.6         9971.5     11.5     41.9     79.7     83.9
    4m3s        0         9998.3         9971.6     11.0     21.0     25.2     31.5
    4m4s        0        10001.8         9971.7     10.5     15.2     17.8     18.9
    4m5s        0         9999.7         9971.8     13.1     50.3    100.7    142.6
    4m6s        0         9998.8         9971.9     11.5     19.9     23.1     25.2
    4m7s        0         9980.4         9972.0     11.0     18.9     28.3     33.6
    4m8s        0        10002.1         9972.1     12.1     24.1     27.3     29.4
    4m9s        0        10018.0         9972.3     13.6     26.2     48.2     52.4
   4m10s        0         9981.9         9972.3     11.0     16.3     18.9     21.0
   4m11s        0        10099.0         9972.8     11.0     14.2     16.8     18.9
   4m12s        0         9907.9         9972.6     11.0     15.7     22.0     26.2
   4m13s        0        10012.6         9972.7     11.5     25.2     50.3     83.9
-->4m14s        0         9998.4         9972.8     12.6     33.6     75.5     83.9
   4m15s        0         9988.2         9972.9     11.0     18.9     21.0     21.0
   4m16s        0        10013.7         9973.0     12.1     25.2     33.6     46.1
   4m17s        0         9992.6         9973.1     10.5     14.2     16.3     27.3
   4m18s        0         9995.8         9973.2     11.0     16.3     16.8     21.0
   4m19s        0         9911.0         9973.0     11.5     16.3     18.9     21.0
   4m20s        0        10099.4         9973.4      9.4     16.3     19.9    142.6

@spencerkimball
Copy link
Member Author

spencerkimball commented Dec 18, 2017

I did the scalefactor=5 tpch restore and dropped the table. It compacted 308 ranges for 11GiB of logical bytes in 4m32s. Note that the actual compaction starts at 23:30:44 and ends at 23:35:16.

Also note that there is constant load on the system, same as in the other tests.

screen shot 2017-12-18 at 6 38 17 pm

@spencerkimball
Copy link
Member Author

Note that I still haven't done the compactor unittests. Will get to that now that I think the design has stabilized.


Review status: 8 of 19 files reviewed at latest revision, 22 unresolved discussions.


pkg/storage/batcheval/cmd_clear_range.go, line 49 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Option three doesn't allow us to remove the assumptions from this method's API.proto comments. You absolutely cannot read or write concurrently, or read after, from this span when ClearRange is invoked. It requires knowledge by the caller that that won't happen, and that results if it does will be undefined. ClearRange is an optimization which clears away many of the safety assumptions the rest of the API provides.

From the SQL standpoint, yes. But it retained the KV transactional guarantees (on a per-range basis) by preventing incorrect reads/writes.

I agree with you both though that there seems to be no salvaging that in a meaningful way due to the partial range issue. Let's remove the fig leaf.

Done.


pkg/storage/compactor/compactor.go, line 44 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

For the rebalancing case, I'm not sure 2min will cut it. For a large table, when replicas are removed in random order, you'll have to wait a while to see contiguous patches. Just something to watch out for. Perhaps being 100% optimal isn't what we're going for anyway.

Which is fine. It still has to pass the thresholds, which means we'll have to wait for enough contiguous ranges to be compactable. I've changed defaultThresholdBytes to 256MiB to set the bar a bit higher. The 2min interval is really to prevent extremely eager action for the case of DROP|TRUNCATE, although even then it's likely not strictly necessary.


pkg/storage/compactor/compactor.go, line 49 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Mention that this is tuned for our L6 target SSTable size.

I've increased this to 256MiB. I'm a little reluctant to mention the size of the SSTables on disk because the bytes threshold is really being compared to logical bytes.


pkg/storage/compactor/compactor.go, line 54 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm not sure this is useful. If it is, could you explain when?

The idea here is to make sure this feature works when folks are dropping tables in a relatively small database. This is how most users are evaluating CockroachDB. If you have 200MiB of data and you drop the whole thing, you want to see you disk space free up, believe me. This is the trigger that ensures that still happens, since the absolute threshold won't trigger. I expanded the comment.


pkg/storage/compactor/compactor.go, line 59 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Explain the rationale behind this number. Isn't 128MB (our L6 SSTable target) a better heuristic?

Also, using GetApproximateSizes looks like the wrong approach. What you really care about is whether the spans together overlap a contiguous set of SSTables; this suggests that you should write some special-cased code that looks at the SSTable metadata, similar to what DBCompactRange does now.

That would also make me feel less concerned about the heuristic in general and would remove this ad-hoc number plus relying on ApproximateSizes.

That's a solid idea. I've removed this defaultMaxGapBytesThreshold. I've added a method to libroach to compute whether two suggested compaction spans overlap contiguous SSTables at the bottommost level.


pkg/storage/compactor/compactor.go, line 64 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This is tuned to the default GC TTL, but with slow enough churn (and GC queue hints) you'd never compact. I know this isn't a goal in this PR, but it is somehow your goal in introducing this mechanism in the first place (admittedly it'll work well for the replicaGCQueue), and so I'm curious to understand how you plan to make this work well with GCQueue suggestions. Not a blocker, but ISTM that a better place to store the suggestions would be in a range-local key that is managed by the GCQueue, where it would send a compaction hint only when it's reasonably sure that a compaction is now warranted (as opposed to putting stuff in this compactor early and letting it endlessly deal with it). But even that is kinda frail, because RocksDB runs its own compactions too and we don't want to duplicate work.

So far I've really only been trying to make this PR handle the DROP|TRUNCATE and the replicaGCQueue cases. Nevertheless, I don't believe the 1 day window here is a problem for the GCQueue case. The main intent there isn't to handle compaction where lots of deletions are happening here and there in the system – I still intend to place responsibility for that kind of work on the RocksDB background compactor's plate. My goal with this compactor queue is to handle the case where an application or operator deletes data in a fell swoop across many ranges. The canonical case is a repeated loop of DELETE FROM <table> LIMIT 10000. In that event, the GC will happen across the ranges at roughly the same time, certainly within 24 hours.


pkg/storage/compactor/compactor.go, line 77 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'd make this a method, for nobody should be able to mutate it:

func defaultCompactorOptions() compactorOptions {
  return ...
}

Done.


pkg/storage/compactor/compactor.go, line 204 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

do you need a defer delBatch.Close() in case the commit fails?

We always call commit on the batch, which has the effect of closing it. Am I missing something?


pkg/storage/compactor/compactor.go, line 218 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Weird that you need to pass startIdx, i, and suggestCount in here.

I re-jiggered this a bit based on your suggestions.


pkg/storage/compactor/compactor.go, line 245 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

pass the following instead

struct aggregatedCompaction {
  storagebase.SuggestedCompaction
  seq int
}

func (ac aggregatedCompaction) String() string {
  return fmt.Sprintf("compaction %d (%s-%s) for %s", ac.seq, ac.StartKey, ac.EndKey, humanizeutil.IBytes(ac.Compaction.Bytes)")
}

and remove startIdx, curIdx, length.

I'm not sure we need capacity either, as mentioned elsewhere.

See comment above. We do still need capacity for the fractional threshold.


pkg/storage/compactor/compactor.go, line 252 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

is that second part ever useful?

As I mentioned in an earlier comment, the fractional threshold is about still compacting when people have small databases, but nevertheless expect disk space to free up.


pkg/storage/compactor/compactor.go, line 263 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Message seems off. startIdx seems like it should be removed.

timeutil.Since(startTime) looks like it should go to metrics to measure long-running compactions.

After my suggestion above, this reads something like

duration := timeutil.Since(startTime)
c.Metrics.CompactingNanos.Inc(int64(duration))
log.Eventf(ctx, processed %s in %s", aggr, duration)

I wanted the log messages to show how things are aggregating – good for debugging if things start acting weird. I made some changes to make it more readable and to get the indexing right.

I've added a metric for time spent compacting.


pkg/storage/compactor/compactor.go, line 270 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

It couldn't be endless because aggregating more and more increases the compaction size and it would hit the threshold.

But it sure seems (at least once the GC queue gives compaction hints) that we could end up with a large amount of suggestions. Say we have 1000 replicas that are far apart and slowly accruing garbage. So every hour or so we receive 1000 compaction hints for small amounts.

As I mentioned in offline comments, my goal with this isn't to send micro hints to the compactor, but only to send suggestions when a range is cleared, rebalanced, or 90% of its data is GC'd.

But to be cautious, I've changed it to always delete records after the max record age.


pkg/storage/compactor/compactor.go, line 287 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Use ClearIterRange instead of ClearRange. The range tombstones left by ClearRange have a cost. (Feel free to rename these methods to indicate that ClearIterRange is the preferred default)

Done.


pkg/storage/compactor/compactor.go, line 369 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There's no double-counting, but it's racy. We may end up over-counting the bytes available to reclaim in a span we just compacted. Something else to watch out for.

Why is it racy? This code path will be protected at a higher level by the command queue and our replica gc machinery. Are you referring to something else?


pkg/storage/compactor/compactor_test.go, line 63 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This test looks unfinished.

I was waiting for buy in before writing the tests.


pkg/storage/engine/enginepb/compact.proto, line 1 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Should move to compactor.

Done.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Reviewed 15 of 15 files at r8.
Review status: all files reviewed at latest revision, 22 unresolved discussions, some commit checks failed.


pkg/storage/batcheval/cmd_clear_range.go, line 57 at r8 (raw file):

	// Add the GC threshold key, as this is updated as part of clear a
	// range of data.
	spans.Add(spanset.SpanReadWrite, roachpb.Span{Key: keys.RangeLastGCKey(header.RangeID)})

We can remove the GC threshold key from declareKeys.


pkg/storage/compactor/compactor.go, line 54 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

The idea here is to make sure this feature works when folks are dropping tables in a relatively small database. This is how most users are evaluating CockroachDB. If you have 200MiB of data and you drop the whole thing, you want to see you disk space free up, believe me. This is the trigger that ensures that still happens, since the absolute threshold won't trigger. I expanded the comment.

For perspective, mysql/innodb in its default configuration did not return unused disk space to the system until version 5.6. There are still parts of the system that are unable to shrink without wiping the disk and starting from scratch. It's great that we're improving our ability to free up space, but remember that failing to do so is not a dealbreaker.


pkg/storage/compactor/compactor.go, line 369 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Why is it racy? This code path will be protected at a higher level by the command queue and our replica gc machinery. Are you referring to something else?

The race is between the GetProto and PutProto in this method, and the deletions in processCompaction. Suppose there is a 1MB compaction record on disk, and then a new 1MB suggestion arrives for the same range. You could have

  • SuggestCompaction reads 1MB from disk
  • Compactor processes the compaction and clears the existing record
  • SuggestCompaction writes a new record for 1MB+1MB=2MB, even though there is now only 1MB available to reclaim (or maybe less, if the compaction included the deletions that this suggestion was based on).

pkg/storage/engine/rocksdb.go, line 301 at r8 (raw file):

	overlapsMoreThanTwo := func(tables []SSTableInfo) bool {
		// Search to find the first sstable which might overlap the span.
		i := sort.Search(len(tables), func(i int) bool { return span.Key.Compare(tables[i].End.Key) < 0 })

tables is sorted by start key, not by end key. This would be invalid at L0, where sstables can overlap. We're not running this on L0, but it's probably worth a comment.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: 17 of 19 files reviewed at latest revision, 22 unresolved discussions.


pkg/storage/batcheval/cmd_clear_range.go, line 57 at r8 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We can remove the GC threshold key from declareKeys.

Done.


pkg/storage/compactor/compactor.go, line 54 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

For perspective, mysql/innodb in its default configuration did not return unused disk space to the system until version 5.6. There are still parts of the system that are unable to shrink without wiping the disk and starting from scratch. It's great that we're improving our ability to free up space, but remember that failing to do so is not a dealbreaker.

I agree to some extent, MySQL/innodb v5.6 was two decades into the product lifecycle. However, you ultimately have to deliver according to customer expectations. We are trying to appeal to a class of customer which would not have taken earlier versions of MySQL very seriously. Not to mention that the goalposts of what an RDBMS should do have been moved substantially forward, even from 2013.


pkg/storage/compactor/compactor.go, line 369 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The race is between the GetProto and PutProto in this method, and the deletions in processCompaction. Suppose there is a 1MB compaction record on disk, and then a new 1MB suggestion arrives for the same range. You could have

  • SuggestCompaction reads 1MB from disk
  • Compactor processes the compaction and clears the existing record
  • SuggestCompaction writes a new record for 1MB+1MB=2MB, even though there is now only 1MB available to reclaim (or maybe less, if the compaction included the deletions that this suggestion was based on).

Ah yes good point. Do you think it's worth fixing this or just a comment?


pkg/storage/engine/rocksdb.go, line 301 at r8 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

tables is sorted by start key, not by end key. This would be invalid at L0, where sstables can overlap. We're not running this on L0, but it's probably worth a comment.

Done.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 21, 2017

Have to dig into the tests and the SSTable-spy tomorrow, but generally this is starting to look close to final. Good work!


Reviewed 4 of 24 files at r6, 9 of 15 files at r8, 2 of 2 files at r9.
Review status: all files reviewed at latest revision, 25 unresolved discussions, some commit checks failed.


pkg/storage/batcheval/cmd_clear_range.go, line 109 at r9 (raw file):

			Compaction: storagebase.Compaction{
				Bytes:            statsDelta.Total(),
				SuggestedAtNanos: now.WallTime,

Just use the timestamp of the request. Taking it from the clock was necessary for the fig leaf, but that is no more.


pkg/storage/compactor/compactor.go, line 49 at r7 (raw file):
That's a good point. Add to the comment?

// Note that we want to target roughly the target size of an L6 SSTable (128MB)
// but these are logical bytes (as in, from MVCCStats) which can't be translated
// into SSTable-bytes. As a result, we conservatively set a higher threshold.


pkg/storage/compactor/compactor.go, line 54 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I agree to some extent, MySQL/innodb v5.6 was two decades into the product lifecycle. However, you ultimately have to deliver according to customer expectations. We are trying to appeal to a class of customer which would not have taken earlier versions of MySQL very seriously. Not to mention that the goalposts of what an RDBMS should do have been moved substantially forward, even from 2013.

I'm personally not convinced by this, but perhaps you're right. If you adequately test it, no opposition on my part.


pkg/storage/compactor/compactor.go, line 204 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

We always call commit on the batch, which has the effect of closing it. Am I missing something?

Probably not, I just don't know what guarantee you have if Commit() fails. Do you still need to close to reliably release resources? Not sure myself.


pkg/storage/compactor/compactor.go, line 220 at r9 (raw file):

	rocksdb, ok := c.eng.(*engine.RocksDB)
	if !ok {
		return false, errors.New("compactor requires engine is a RocksDB instance")

nit: requires that


pkg/storage/compactor/compactor.go, line 248 at r9 (raw file):

			processedBytes, err := c.processCompaction(ctx, aggr, capacity, delBatch)
			if err != nil {
				return false, err

Wouldn't you want to continue compacting "other things"? I think this would work better if aggr started out with an empty SuggestedCompaction.


pkg/storage/compactor/compactor.go, line 287 at r9 (raw file):

	if shouldProcess {
		startTime := timeutil.Now()
		log.Eventf(ctx, "processing suggested compaction %s (%s-%s) for %s",

Pull most of the above into aggr.String(), and then all you have to do is

log.Eventf(ctx, "processing suggested compaction %s", aggr)

Actually, may wanna scratch "suggested".


pkg/storage/compactor/compactor.go, line 299 at r9 (raw file):

		duration := timeutil.Since(startTime)
		c.Metrics.CompactingNanos.Inc(int64(duration))
		log.Eventf(ctx, "processed suggested compaction %s in %s", seqFmt, duration)

Ditto:

log.Eventf(ctx, "processed compaction %s in %s", aggr, duration)

(the trace will already keep track of the duration, btw, but since you have it might as well print it)


pkg/storage/compactor/compactor.go, line 301 at r9 (raw file):

		log.Eventf(ctx, "processed suggested compaction %s in %s", seqFmt, duration)
	} else {
		log.VEventf(ctx, 2, "skipping suggested compaction(s) %s (%s-%s) for %s",

Ditto


pkg/storage/compactor/compactor.go, line 339 at r9 (raw file):

	// "nearly" contiguous.
	if aggr.EndKey.Less(sc.StartKey) {
		// Aggregate if the combined compaction span overlaps (at most)

I'm confused. Why at most? If you have two key spans that touch on the SST level, wouldn't you always want to merge them, no matter how many SSTables are overlapping?

Why don't you just check whether the "gap" [aggr.EndKey, sc.StartKey] is at most two adjacent SSTables at the lowest level? I thought that's what we wanted.


pkg/storage/compactor/compactor.go, line 342 at r9 (raw file):

		// two contiguous SSTables at the bottommost level.
		span := roachpb.Span{Key: roachpb.Key(aggr.StartKey), EndKey: roachpb.Key(sc.EndKey)}
		maxLevel := ssti.MaxLevelSpanOverlapsContiguousSSTables(span)

I haven't looked at that method yet which probably explains why I'm confused, but I expected MinLevel here.


pkg/storage/compactor/metrics.go, line 36 at r9 (raw file):

	metaBytesQueued = metric.Metadata{
		Name: "compactor.suggestionbytes.queued",
		Help: "Number of bytes in suggested compactions in the queue"}

Could you sprinkle logical bytes throughout the descriptions?


pkg/storage/engine/rocksdb.go, line 265 at r9 (raw file):

// SSTableInfosByLevel maintains slices of SSTableInfo objects, one
// per level, sorted first by level and second by start key. This is

should size be in there somehow or are you just saying that the levels are increasing in the outer slice and then each subslice is sorted by start key?


pkg/storage/engine/rocksdb.go, line 267 at r9 (raw file):

// per level, sorted first by level and second by start key. This is
// different from the default sort order, which is by level, then
// size, then start key..

nit: double dot


pkg/storage/engine/rocksdb.go, line 269 at r9 (raw file):

// size, then start key..
type SSTableInfosByLevel struct {
	// Each level is a slice of SSTableInfos

nit: .


pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file):

// MaxLevelSpanOverlapsContiguousSSTables returns the maximum level at
// which the specified key span overlaps either none, one, or at most
// two contiguous SSTables. Level 0 is returned if no level qualifies.

I have to look at this with fresh eyes tomorrow, but still confused about the "at most two".


pkg/storage/engine/rocksdb_test.go, line 874 at r9 (raw file):

		expMaxLevel int
	}{
		{span: roachpb.Span{Key: roachpb.Key("a"), EndKey: roachpb.Key("z")}, expMaxLevel: 0},

Comments please! 😄


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: 11 of 19 files reviewed at latest revision, 25 unresolved discussions.


pkg/storage/batcheval/cmd_clear_range.go, line 109 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Just use the timestamp of the request. Taking it from the clock was necessary for the fig leaf, but that is no more.

Done.


pkg/storage/compactor/compactor.go, line 49 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

That's a good point. Add to the comment?

// Note that we want to target roughly the target size of an L6 SSTable (128MB)
// but these are logical bytes (as in, from MVCCStats) which can't be translated
// into SSTable-bytes. As a result, we conservatively set a higher threshold.

Done.


pkg/storage/compactor/compactor.go, line 54 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm personally not convinced by this, but perhaps you're right. If you adequately test it, no opposition on my part.

We've had more people query us about why space isn't freeing up on a small database than on large ones. But it's now pretty well tested.


pkg/storage/compactor/compactor.go, line 204 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Probably not, I just don't know what guarantee you have if Commit() fails. Do you still need to close to reliably release resources? Not sure myself.

My review of the code after your original comment suggests that it closes on failure.


pkg/storage/compactor/compactor.go, line 220 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: requires that

I removed this in favor of a new interface which adds the GetSSTables method.


pkg/storage/compactor/compactor.go, line 248 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Wouldn't you want to continue compacting "other things"? I think this would work better if aggr started out with an empty SuggestedCompaction.

I'm logging an error now and continuing with whatever other suggested compactions may be waiting.


pkg/storage/compactor/compactor.go, line 287 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Pull most of the above into aggr.String(), and then all you have to do is

log.Eventf(ctx, "processing suggested compaction %s", aggr)

Actually, may wanna scratch "suggested".

Done.


pkg/storage/compactor/compactor.go, line 299 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ditto:

log.Eventf(ctx, "processed compaction %s in %s", aggr, duration)

(the trace will already keep track of the duration, btw, but since you have it might as well print it)

Done.


pkg/storage/compactor/compactor.go, line 301 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ditto

Done.


pkg/storage/compactor/compactor.go, line 339 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm confused. Why at most? If you have two key spans that touch on the SST level, wouldn't you always want to merge them, no matter how many SSTables are overlapping?

Why don't you just check whether the "gap" [aggr.EndKey, sc.StartKey] is at most two adjacent SSTables at the lowest level? I thought that's what we wanted.

Yes, this was an error I caught in my unittests.


pkg/storage/compactor/compactor.go, line 342 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I haven't looked at that method yet which probably explains why I'm confused, but I expected MinLevel here.

MinLevel? That would be L1...compacting there means you'd likely have to compact your entire database given that we force to the bottommost level. Ouch. By checking against max level, I'm ensuring that the gap isn't larger than two L6 SSTables.


pkg/storage/compactor/metrics.go, line 36 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Could you sprinkle logical bytes throughout the descriptions?

Done. Good idea.


pkg/storage/engine/rocksdb.go, line 265 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

should size be in there somehow or are you just saying that the levels are increasing in the outer slice and then each subslice is sorted by start key?

Size is irrelevant to the project here, since we're sorting first by start key and the sstables are disjoint at L1+. I've tried to clarify the comment.


pkg/storage/engine/rocksdb.go, line 267 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: double dot

Done.


pkg/storage/engine/rocksdb.go, line 269 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: .

Done.


pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I have to look at this with fresh eyes tomorrow, but still confused about the "at most two".

Here's the thinking: Let's say you have only L0, L1, & L2 sstables in a small database for example purposes. And let's say they look as follows:

		// Level 0.
                <omitted as irrelevant>
		// Level 1.
		{Level: 1, Size: 200, Start: key("a"), End: key("j")},
		{Level: 1, Size: 100, Start: key("k"), End: key("o")},
		{Level: 1, Size: 100, Start: key("r"), End: key("t")},
		// Level 2.
		{Level: 2, Size: 201, Start: key("a"), End: key("c")},
		{Level: 2, Size: 200, Start: key("d"), End: key("f")},
		{Level: 2, Size: 300, Start: key("h"), End: key("r")},
		{Level: 2, Size: 405, Start: key("u"), End: key("z")},
  • The span "a"-"c" overlaps only a single SSTable at the max level (L2). That's great, so we definitely want to compact that.
  • The span "s"-"t" overlaps zero SSTables at the max level (L2) . Again, great! That means we're going to compact the 3rd L1 SSTable and maybe push that directly to L2.

pkg/storage/engine/rocksdb_test.go, line 874 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Comments please! 😄

Writing comments exposed a bug. Thank you.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 11 of 19 files reviewed at latest revision, 25 unresolved discussions, some commit checks pending.


pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Here's the thinking: Let's say you have only L0, L1, & L2 sstables in a small database for example purposes. And let's say they look as follows:

		// Level 0.
                <omitted as irrelevant>
		// Level 1.
		{Level: 1, Size: 200, Start: key("a"), End: key("j")},
		{Level: 1, Size: 100, Start: key("k"), End: key("o")},
		{Level: 1, Size: 100, Start: key("r"), End: key("t")},
		// Level 2.
		{Level: 2, Size: 201, Start: key("a"), End: key("c")},
		{Level: 2, Size: 200, Start: key("d"), End: key("f")},
		{Level: 2, Size: 300, Start: key("h"), End: key("r")},
		{Level: 2, Size: 405, Start: key("u"), End: key("z")},
  • The span "a"-"c" overlaps only a single SSTable at the max level (L2). That's great, so we definitely want to compact that.
  • The span "s"-"t" overlaps zero SSTables at the max level (L2) . Again, great! That means we're going to compact the 3rd L1 SSTable and maybe push that directly to L2.

I haven't been following the developments of this PR, but this caught my eye. We should never have the situation where we have L2 sstables but not L6 sstables. See the level_compaction_dynamic_level_bytes RocksDB setting. I mention this because it sort of fouls up the intuition that levels fill from top to bottom. We very frequently see stores with L0, L1 and L6 sstables, but nothing in L2-L5.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I haven't been following the developments of this PR, but this caught my eye. We should never have the situation where we have L2 sstables but not L6 sstables. See the level_compaction_dynamic_level_bytes RocksDB setting. I mention this because it sort of fouls up the intuition that levels fill from top to bottom. We very frequently see stores with L0, L1 and L6 sstables, but nothing in L2-L5.

That’s good to know, but it doesn’t change the algorithm here. Or do you think it does somehow that I’m missing?


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 11 of 19 files reviewed at latest revision, 25 unresolved discussions, some commit checks failed.


pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

That’s good to know, but it doesn’t change the algorithm here. Or do you think it does somehow that I’m missing?

Not sure if you're missing anything, just pointing out that you should be considering gaps in the level structure when thinking through examples.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 22, 2017

Only have to look at the tests. Rest LGTM mod (insubstantial) comments.


Reviewed 8 of 8 files at r10.
Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


pkg/storage/compactor/compactor.go, line 342 at r9 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

MinLevel? That would be L1...compacting there means you'd likely have to compact your entire database given that we force to the bottommost level. Ouch. By checking against max level, I'm ensuring that the gap isn't larger than two L6 SSTables.

I was just confused because I wasn't sure what was going on. Resolved now.


pkg/storage/engine/engine.go, line 243 at r10 (raw file):

// EngineWithSSTables extends the Engine interface with a method to get
// info on all SSTables in use.
type EngineWithSSTables interface {

I'd prefer if you just threw this into Engine, even if that means adding a panic("unimplemented") somewhere.


pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file):
Thanks for the explanation, Spencer. That makes sense, but only if you know how we're going to use this, namely for the gap between two existing compactions. Could you tailor the explanation to this use case and also insert your example? Here's something to start you off:

This is useful when considering when to merge two compactions. In this case, the
method is called with the "gap" between the two spans to be compacted. When the
result is that the gap span touches at most two SSTables at a high level, it
suggests that merging the two compactions is a good idea (as the up to two
SSTables touched by the gap span, due to containing endpoints of the existing
compactions, would be rewritten anyway).

Ideally, the example would also have empty levels, just to be more true to reality with our configuration.


pkg/storage/engine/rocksdb.go, line 317 at r10 (raw file):

			(i < len(tables)-2 && span.EndKey.Compare(tables[i+2].Start.Key) <= 0) {
			return false
		}

how about

if i >= len(tables) - 1 {
  // First overlapped SSTable is the last (right-most) SSTable.
  //    Span:   [c-----f)
  //    SSTs: [a---d)
  // or
  //    SSTs: [a-----------q)
  return false
}
if span.EndKey.Compare(tables[i+1].End.Key) <= 0 {
  // Span does not reach outside of this SSTable's right neighbor.
  //    Span:    [c------f)
  //    SSTs: [a---d) [e-f) ...
  return false
}
if i >= len(tables) - 2 {
  // Span reaches outside of this SSTable's right neighbor, but
  // there are no more SSTables to the right.
  //    Span:    [c-------------x)
  //    SSTs: [a---d) [e---q)
  return false
}
if span.EndKey.Compare(tables[i+2].Start.Key) <= 0 {
  // There's another SSTable two to the right, but the span doesn't
  // reach into it.
  //    Span:    [c------------x)
  //    SSTs: [a---d) [e---q) [x--z) ...
  return false
}

// Touching at least three SSTables.
//    Span:    [c-------------y)
//    SSTs: [a---d) [e---q) [x--z) ...

return true

Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 22, 2017

:lgtm:


Reviewed 1 of 15 files at r8.
Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.


pkg/storage/compactor/compactor_test.go, line 71 at r10 (raw file):

		{Level: 2, Size: 300, Start: key("h"), End: key("r")},
		{Level: 2, Size: 405, Start: key("s"), End: key("z")},
	}

Should have some empty levels in between for realism. Just move L2 to L6 and insert some more L2 which (I think) would allow you to avoid large changes in the tests?


pkg/storage/compactor/compactor_test.go, line 127 at r10 (raw file):

			suggestions: []storagebase.SuggestedCompaction{
				{
					StartKey: rkey("a"), EndKey: rkey("b"),

Was here a reason to make these RKey? Imo they should be Key (which is what all of the engine-level stuff deals in).


pkg/storage/compactor/compactor_test.go, line 191 at r10 (raw file):

					StartKey: rkey("b"), EndKey: rkey("c"),
					Compaction: storagebase.Compaction{
						Bytes:            defaultThresholdBytes - (defaultThresholdBytes / 2),

Nice touch dealing with the integer division.


pkg/storage/compactor/compactor_test.go, line 305 at r10 (raw file):

		// Double suggestion with too much space in between.
		{
			name: "double suggestion with too much space",

s/space/gap here and above and below?


pkg/storage/compactor/compactor_test.go, line 465 at r10 (raw file):

					t.Errorf("expected compaction time > 0? %t; got %d", e, a)
				}
				// Read the remaining suggestions in the queue; verify compacted

Insert a time.Sleep(testCompactionLatency) into the wrappedEngines compaction method, and verify that the "time spent" metric increases by at least len(test.expCompactions)*testCompactionLatency.Nanoseconds().


pkg/storage/compactor/compactor_test.go, line 466 at r10 (raw file):

				}
				// Read the remaining suggestions in the queue; verify compacted
				// spans have been cleared and uncompacted spans remain.

Once you're here, shouldn't the rest always succeed? That would suggest moving it out of the SucceedsSoon.


pkg/storage/compactor/compactor_test.go, line 474 at r10 (raw file):

						start, end, err := keys.DecodeStoreSuggestedCompactionKey(kv.Key.Key)
						if err != nil {
							return false, errors.Wrapf(err, "failed to decode suggested compaction key")

t.Fatal? Ditto below.


pkg/storage/compactor/compactor_test.go, line 505 at r10 (raw file):

	// Add a suggested compaction -- this won't get processed by this
	// compactor for two minutes.

Could you just set compactor.ops.CompactionMinInterval = time.Hour to make that explicit?


pkg/storage/compactor/compactor_test.go, line 515 at r10 (raw file):

	// Create a new fast compactor with a short wait time for processing,
	// using the same engine so that it see a non-empty queue.

sees (unless this is intentionally in subjunctive mood).


pkg/storage/compactor/compactor_test.go, line 519 at r10 (raw file):

	fastCompactor := NewCompactor(we, capacityFn)
	fastCompactor.Start(context.Background(), tracing.NewTracer(), stopper)
	defer func() {

defer stopper.Stop(context.Background())


pkg/storage/compactor/compactor_test.go, line 522 at r10 (raw file):

		stopper.Stop(context.Background())
	}()
	fastCompactor.opts.CompactionMinInterval = time.Millisecond

Shouldn't you set this before calling .Start?


pkg/storage/compactor/compactor_test.go, line 571 at r10 (raw file):

			context.Background(), keys.LocalStoreSuggestedCompactionsMin, keys.LocalStoreSuggestedCompactionsMax,
		); err != nil || !empty {
			return fmt.Errorf("compaction queue not empty or err: %t, %s", empty, err)

use the %v verb for err since it may be nil. See https://play.golang.org/p/-UrrT7C1Q_d.


pkg/storage/engine/rocksdb.go, line 301 at r8 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

Not done?


pkg/storage/engine/rocksdb_test.go, line 869 at r10 (raw file):

	}

	// Next, verify various contiguous overlap scenarios.

Have only casually browsed this test. There's a lot of cognitive overhead to verify the results. Unfortunately, I don't have a good idea on how to improve it.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.


pkg/storage/compactor/compactor.go, line 342 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I was just confused because I wasn't sure what was going on. Resolved now.

Gotcha. Yeah this stuff is confusing until you pore over it long enough. Even then.


pkg/storage/compactor/compactor_test.go, line 71 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Should have some empty levels in between for realism. Just move L2 to L6 and insert some more L2 which (I think) would allow you to avoid large changes in the tests?

Did that and it worked as expected. I moved L1 -> L2 and L2 -> L6.


pkg/storage/compactor/compactor_test.go, line 127 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Was here a reason to make these RKey? Imo they should be Key (which is what all of the engine-level stuff deals in).

No, I believe other things at the same level were and I was cargo-culting. Changed.


pkg/storage/compactor/compactor_test.go, line 191 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Nice touch dealing with the integer division.

:-)


pkg/storage/compactor/compactor_test.go, line 305 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/space/gap here and above and below?

Done.


pkg/storage/compactor/compactor_test.go, line 465 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Insert a time.Sleep(testCompactionLatency) into the wrappedEngines compaction method, and verify that the "time spent" metric increases by at least len(test.expCompactions)*testCompactionLatency.Nanoseconds().

Neat idea. Done.


pkg/storage/compactor/compactor_test.go, line 466 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Once you're here, shouldn't the rest always succeed? That would suggest moving it out of the SucceedsSoon.

Not necessarily. All of the metrics and the actual compactions slice can be set before we manage to commit the delBatch. Better to leave it here.


pkg/storage/compactor/compactor_test.go, line 474 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

t.Fatal? Ditto below.

Done. But not sure I understand what the ditto refers to.


pkg/storage/compactor/compactor_test.go, line 505 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Could you just set compactor.ops.CompactionMinInterval = time.Hour to make that explicit?

Done.


pkg/storage/compactor/compactor_test.go, line 515 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

sees (unless this is intentionally in subjunctive mood).

Done.


pkg/storage/compactor/compactor_test.go, line 519 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

defer stopper.Stop(context.Background())

Done.


pkg/storage/compactor/compactor_test.go, line 522 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Shouldn't you set this before calling .Start?

Done.


pkg/storage/compactor/compactor_test.go, line 571 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

use the %v verb for err since it may be nil. See https://play.golang.org/p/-UrrT7C1Q_d.

Done.


pkg/storage/engine/engine.go, line 243 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'd prefer if you just threw this into Engine, even if that means adding a panic("unimplemented") somewhere.

So far we've avoided adulterating Engine with any implementation-specific concepts. I'm fully aware how tied to RocksDB we are regardless, but this seems like a reasonable approach. Ben suggested this. How strongly do you feel about this point?


pkg/storage/engine/rocksdb.go, line 301 at r8 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Not done?

I added a comment below in the last commit. I've added another before overlapsMoreThanTwo for extra clarity.


pkg/storage/engine/rocksdb.go, line 297 at r9 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Thanks for the explanation, Spencer. That makes sense, but only if you know how we're going to use this, namely for the gap between two existing compactions. Could you tailor the explanation to this use case and also insert your example? Here's something to start you off:

This is useful when considering when to merge two compactions. In this case, the
method is called with the "gap" between the two spans to be compacted. When the
result is that the gap span touches at most two SSTables at a high level, it
suggests that merging the two compactions is a good idea (as the up to two
SSTables touched by the gap span, due to containing endpoints of the existing
compactions, would be rewritten anyway).

Ideally, the example would also have empty levels, just to be more true to reality with our configuration.

Done.


pkg/storage/engine/rocksdb.go, line 317 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

how about

if i >= len(tables) - 1 {
  // First overlapped SSTable is the last (right-most) SSTable.
  //    Span:   [c-----f)
  //    SSTs: [a---d)
  // or
  //    SSTs: [a-----------q)
  return false
}
if span.EndKey.Compare(tables[i+1].End.Key) <= 0 {
  // Span does not reach outside of this SSTable's right neighbor.
  //    Span:    [c------f)
  //    SSTs: [a---d) [e-f) ...
  return false
}
if i >= len(tables) - 2 {
  // Span reaches outside of this SSTable's right neighbor, but
  // there are no more SSTables to the right.
  //    Span:    [c-------------x)
  //    SSTs: [a---d) [e---q)
  return false
}
if span.EndKey.Compare(tables[i+2].Start.Key) <= 0 {
  // There's another SSTable two to the right, but the span doesn't
  // reach into it.
  //    Span:    [c------------x)
  //    SSTs: [a---d) [e---q) [x--z) ...
  return false
}

// Touching at least three SSTables.
//    Span:    [c-------------y)
//    SSTs: [a---d) [e---q) [x--z) ...

return true

Wow...done.


pkg/storage/engine/rocksdb_test.go, line 869 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Have only casually browsed this test. There's a lot of cognitive overhead to verify the results. Unfortunately, I don't have a good idea on how to improve it.

I'm fairly comfortable with it.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 23, 2017

:lgtm_strong:


Reviewed 6 of 6 files at r11.
Review status: all files reviewed at latest revision, 13 unresolved discussions.


pkg/storage/compactor/compactor.go, line 393 at r11 (raw file):

	// Check whether a suggested compaction already exists for this key span.
	key := keys.StoreSuggestedCompactionKey(roachpb.RKey(sc.StartKey), roachpb.RKey(sc.EndKey))

Any point in the RKey-ness of StoreSuggestedCompactionKey?


pkg/storage/compactor/compactor_test.go, line 474 at r10 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done. But not sure I understand what the ditto refers to.

This was in reference to moving everything here out of SucceedsSoon. Disregard!


pkg/storage/engine/engine.go, line 243 at r10 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

So far we've avoided adulterating Engine with any implementation-specific concepts. I'm fully aware how tied to RocksDB we are regardless, but this seems like a reasonable approach. Ben suggested this. How strongly do you feel about this point?

I prefer the explicitness over finding out the hard way via an interface assertion panic in (*Store).Start. Hence my suggestion, but it's a minor point that I don't feel strongly about.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.


pkg/storage/compactor/compactor.go, line 393 at r11 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Any point in the RKey-ness of StoreSuggestedCompactionKey?

I was under the impression that those routines were meant to deal with raw keys specifically. Is that not the case? Or because we don't expect these kinds of keys to ever have addressing info, it's moot?


pkg/storage/compactor/compactor_test.go, line 474 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This was in reference to moving everything here out of SucceedsSoon. Disregard!

Gotcha.


pkg/storage/engine/engine.go, line 243 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I prefer the explicitness over finding out the hard way via an interface assertion panic in (*Store).Start. Hence my suggestion, but it's a minor point that I don't feel strongly about.

I figure that will happen in testing when we introduce a new engine type and we'll add the right bifurcation at that point.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 23, 2017

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


pkg/storage/compactor/compactor.go, line 393 at r11 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I was under the impression that those routines were meant to deal with raw keys specifically. Is that not the case? Or because we don't expect these kinds of keys to ever have addressing info, it's moot?

Yeah. RKey matters only for keys that need to be routed through DistSender.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


pkg/storage/compactor/compactor.go, line 393 at r11 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Yeah. RKey matters only for keys that need to be routed through DistSender.

Done.


Comments from Reviewable

Clear range commands now come with an attendant suggested range compaction
hint. Any suggested compactions generated during command execution are now
sent via replicated result data to each replica and stored in a store-local
queue of pending compaction suggestions.

A new compactor goroutine runs periodically to process pending suggestions.
If more than an absolute number of bytes is reclaimable, or if the bytes
to reclaim exceed a threshold fraction of the total used bytes, we'll go
ahead and compact the suggested range.

Suggested compactions are allowed to remain in the queue for at most
24 hours, after which if they haven't been aggregated into a compact-able
key span, they'll be discarded, and left to RocksDB's background compaction
processing.

Release note (UX improvement): When tables are dropped, the
space will be reclaimed in a more timely fashion.
@spencerkimball spencerkimball merged commit bd7e5fb into cockroachdb:master Dec 24, 2017
@spencerkimball spencerkimball deleted the compaction-queue branch December 24, 2017 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants