Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sstable: add meta block for range keys; support writing range keys #1400

Merged
merged 1 commit into from
Dec 13, 2021

Conversation

nicktrav
Copy link
Contributor

@nicktrav nicktrav commented Dec 8, 2021

Range keys (see #1341) will be stored in their own, single block of an
sstable. Add a new, optional meta block, indexed as "pebble.range_key"
in the metablock index, to the sstable structure. This block is only
present when at least one range key has been written to the sstable.

Add the ability to add range keys to an sstable via
(*sstable.Writer).Write.

Update existing data-driven tests to support printing of the range key
summary. Add additional test coverage demonstrating writing of range
keys with an sstable.Writer.

Add minimal functionality to sstable.Reader to support writing the
data-driven test cases for the writer. Additional read-oriented
functionality will be added in a subsequent patch.

Related to #1339.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nicktrav nicktrav requested a review from jbowens December 8, 2021 03:39
@nicktrav nicktrav changed the title sstable: add meta block for range keys; support writing range keys WIP: sstable: add meta block for range keys; support writing range keys Dec 8, 2021
@nicktrav nicktrav changed the title WIP: sstable: add meta block for range keys; support writing range keys sstable: add meta block for range keys; support writing range keys Dec 8, 2021
testdata/event_listener Outdated Show resolved Hide resolved
sstable/writer.go Outdated Show resolved Hide resolved
sstable/writer.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick pass: looks good! I think the big item remaining is the public interface, but we can also defer the interface to subsequent PRs.

Reviewed 5 of 14 files at r1.
Reviewable status: 5 of 14 files reviewed, 8 unresolved discussions (waiting on @jbowens and @nicktrav)


sstable/table.go, line 162 at r1 (raw file):

	rocksDBFormatVersion2 = 2

	// FIXME(travers): Are we ok with this naming?

sgtm


sstable/writer.go, line 245 at r1 (raw file):

	}
	return w.addPoint(key, value)
}

I alluded to this when we talked last week, but there's a complication to the interface that we export in that CockroachDB needs to be able to construct sstables for ingestion that contain range keys, but we don't want to expose the internals of fragmentation, coalescing or the value-encoding to CockroachDB.

When CockroachDB is constructing a sstable for ingest, it'll read individual fragmented range key sets (containing a single suffix, value pair and bounds). As it reads them, it'll need to pass them to the Writer. However, we don't want to just write those verbatim to the sstable block as internal range keys: we'll want to fragment and coalesce the logical range keys, so that we write coalesced range keys to the block.

On the other hand, during a compaction Pebble will use a merging range key iter to produce coalesced, nonoverlapping CoalescedSpans that need to be written to the sstable as up to 1 key per range-key kind, in-order. This use case could make use of an interface like the one exposed here.

I think we'll want to support two separate interfaces for these (and perhaps prohibit range keys from being added through Add to avoid accidental misuse). Maybe something like

(*Writer) AddInternalRangeKey(key base.InternalKey, value []byte)

and

(*Writer) RangeKeySet(start, end, suffix, value []byte)
(*Writer) RangeKeyUnset(start, end, suffix []byte)
(*Writer) RangeKeyDelete(start, end []byte)

sstable/writer.go, line 396 at r1 (raw file):

		if !ok {
			// We panic here as we should have previously decoded and validated this
			// key and value when it was first added to teh range key block.

nit: s/teh/the/


sstable/writer.go, line 440 at r1 (raw file):

				// NOTE: the inequality excludes zero, as we allow the end key of the
				// lower span be the same as the start key of the upper span, even
				// though the range end key is considered an exclusive bound.

nit: instead of "even though the range end key is considered an exclusive bound," should this say "because the range end key is considered an exclusive bound."? The exclusivity of the end bound is what prevents them from overlapping.


sstable/writer.go, line 464 at r1 (raw file):

	// FIXME(travers): More granular properties for RANGEKEYSET, *UNSET, *DEL
	// counts?

I think more granular properties are a good idea: It might be useful for compaction heuristics.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 14 files at r1.
Reviewable status: 6 of 14 files reviewed, 8 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


sstable/writer.go, line 464 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think more granular properties are a good idea: It might be useful for compaction heuristics.

+1

@nicktrav nicktrav force-pushed the nickt.sstable-rangekeys branch from 156a588 to cf53b51 Compare December 8, 2021 20:37
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer the interface to subsequent PRs.

That would probably be good for directing focus towards internal implementation / external use. I will update this patch to add (*Writer) AddInternalRangeKey(key base.InternalKey, value []byte), as you suggested below. This would be for "internal" use (compactions, etc.).

Then in a follow up, I'll add the per-key-kind functions that cockroach can use, which will also contain the logic to fragment, etc.

Dismissed @jbowens from a discussion.
Reviewable status: 4 of 14 files reviewed, 2 unresolved discussions (waiting on @jbowens and @sumeerbhola)


sstable/table.go, line 162 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

sgtm

👍


sstable/writer.go, line 245 at r1 (raw file):

I think we'll want to support two separate interfaces for these (and perhaps prohibit range keys from being added through Add to avoid accidental misuse)

Sounds good! It should be straight forward to adapt what I have here to make use of the first, which will have more stricter requirements on ordering, etc.

I will tackle the remainder separately, as they will need to make use of the fragmenter, less restrictions on key order, new test cases, etc.


sstable/writer.go, line 390 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Note to self, this comment is old. Update it.

Done.


sstable/writer.go, line 405 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Note to self, fix the span in the error message.

Done.


sstable/writer.go, line 440 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: instead of "even though the range end key is considered an exclusive bound," should this say "because the range end key is considered an exclusive bound."? The exclusivity of the end bound is what prevents them from overlapping.

Updated.


sstable/writer.go, line 464 at r1 (raw file):

Previously, sumeerbhola wrote…

+1

Great. I'll add counters for the separate key kinds.

@nicktrav nicktrav force-pushed the nickt.sstable-rangekeys branch 2 times, most recently from f2d0bb1 to 428046b Compare December 8, 2021 23:31
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be good for a proper look now. Have addressed all FIXMEs and exposed the new function on the Writer for Pebble-internal range key addition.

Reviewable status: 3 of 14 files reviewed, 2 unresolved discussions (waiting on @jbowens and @sumeerbhola)


sstable/writer.go, line 464 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Great. I'll add counters for the separate key kinds.

Done.

@nicktrav nicktrav marked this pull request as ready for review December 8, 2021 23:35
@nicktrav nicktrav mentioned this pull request Dec 9, 2021
29 tasks
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 14 files at r1, 1 of 3 files at r2, 2 of 7 files at r3.
Reviewable status: 7 of 14 files reviewed, 6 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


sstable/properties.go, line 120 at r3 (raw file):

	NumRangeKeyDels uint64 `prop:"pebble.num.range-key-dels"`
	// The number of range keys in this table.
	NumRangeKeys uint64 `prop:"pebble.num.range-keys"`

I think we should drop the aggregate NumRangeKeys property, and we can compute it by summing the individual kinds if we need it.


sstable/writer.go, line 385 at r3 (raw file):

// AddInternalRangeKey adds a range key set, unset, or delete key/value pair to
// the table being writer.

nit: s/writer/written/


sstable/writer.go, line 388 at r3 (raw file):

//
// Range keys must be added ordered by their start key, and in their fragmented
// span order. Other than spans that are perfectly aligned (same start and end

Wondering if we should adjust the phrasing to indicate that the key parameter, indicating the Start boundary, must be strictly ascending.


sstable/writer.go, line 417 at r3 (raw file):

		}

		switch c := base.InternalCompare(w.compare, prevStartKey, curStartKey); {

I didn't catch this on my first pass, but this InternalCompare comparison should always return -1 with properly ordered range keys. If the user keys are equal, it will compare on sequence numbers, and if those are equal, it will compare on kind. So the c == 0 case is always an error case because the user is adding a duplicate key.

We could replace the switch with:

if base.InternalCompare(w.compare, prevStartKey, curStartKey) >= 0 {
	w.err = errors.Errorf("pebble: range keys starts must be added in ascending order: %s, %s",
		prevStartKey.Pretty(w.formatKey), key.Pretty(w.formatKey))

}

and lift the fragmentation bounds checks from the default: case into the main function block.


sstable/writer.go, line 844 at r3 (raw file):

		}
		w.meta.LargestRangeKey = base.MakeInternalKey(endKey, base.InternalKeySeqNumMax, key.Kind()).Clone()
		rangeKeyBH, err = w.writeBlock(w.rangeKeyBlock.finish(), NoCompression)

@sumeerbhola — do you know why we don't compress the range-del block? I'm wondering if whatever reason also applies here.

Compression is more desirable for range keys, since they carry values but are still fragmented.


sstable/writer.go, line 1048 at r3 (raw file):

		rangeDelBlock: blockWriter{
			restartInterval: 1,
		},

I think we need to initialize the rangeKeyBlock field and its restartInterval to 1. This disables key prefix compression. I think we should also update this blockIter comment to include range keys in its description of key stability because we'll require that stability for range keys too:

// A blockIter provides an additional guarantee around key stability when a
// block has a restart interval of 1 (i.e. when there is no prefix
// compression). Key stability refers to whether the InternalKey.UserKey bytes
// returned by a positioning call will remain stable after a subsequent
// positioning call. The normal case is that a positioning call will invalidate
// any previously returned InternalKey.UserKey. If a block has a restart
// interval of 1 (no prefix compression), blockIter guarantees that
// InternalKey.UserKey will point to the key as stored in the block itself
// which will remain valid until the blockIter is closed. The key stability
// guarantee is used by the range tombstone code which knows that range
// tombstones are always encoded with a restart interval of 1. This per-block
// key stability guarantee is sufficient for range tombstones as they are
// always encoded in a single block.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 14 files reviewed, 6 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


sstable/writer.go, line 844 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

@sumeerbhola — do you know why we don't compress the range-del block? I'm wondering if whatever reason also applies here.

Compression is more desirable for range keys, since they carry values but are still fragmented.

Hmm, I suspect it's because we cache compressed blocks, so we want to avoid the CPU cost of decompressing the range-del block on every read in the sstable. That concern also applies to range keys, because every MVCC read will need to consult the range-key block if there is one...

I think we should keep it uncompressed for now, but I think this deserves more thought.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thought of something—We'll want to feed these range keys to the block property collectors, but probably through a new range-key specific method on the BlockPropertyCollector. Some block-level property filters might want to adjust the table-level properties based on the range keys.

I think tackling that in a followup PR is fine too.

Reviewable status: 7 of 14 files reviewed, 6 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an item to the list for wiring up the block property collectors.

Reviewable status: 7 of 14 files reviewed, 5 unresolved discussions (waiting on @jbowens and @sumeerbhola)


sstable/properties.go, line 120 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think we should drop the aggregate NumRangeKeys property, and we can compute it by summing the individual kinds if we need it.

Done.


sstable/writer.go, line 388 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Wondering if we should adjust the phrasing to indicate that the key parameter, indicating the Start boundary, must be strictly ascending.

Done. Matched the verbiage that we use elsewhere.


sstable/writer.go, line 417 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I didn't catch this on my first pass, but this InternalCompare comparison should always return -1 with properly ordered range keys. If the user keys are equal, it will compare on sequence numbers, and if those are equal, it will compare on kind. So the c == 0 case is always an error case because the user is adding a duplicate key.

We could replace the switch with:

if base.InternalCompare(w.compare, prevStartKey, curStartKey) >= 0 {
	w.err = errors.Errorf("pebble: range keys starts must be added in ascending order: %s, %s",
		prevStartKey.Pretty(w.formatKey), key.Pretty(w.formatKey))

}

and lift the fragmentation bounds checks from the default: case into the main function block.

Ah nice catch. That was left over from before making use of base.InternalCompare. Cleaned this up.


sstable/writer.go, line 844 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Hmm, I suspect it's because we cache compressed blocks, so we want to avoid the CPU cost of decompressing the range-del block on every read in the sstable. That concern also applies to range keys, because every MVCC read will need to consult the range-key block if there is one...

I think we should keep it uncompressed for now, but I think this deserves more thought.

Added a TODO.


sstable/writer.go, line 1048 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think we need to initialize the rangeKeyBlock field and its restartInterval to 1. This disables key prefix compression. I think we should also update this blockIter comment to include range keys in its description of key stability because we'll require that stability for range keys too:

// A blockIter provides an additional guarantee around key stability when a
// block has a restart interval of 1 (i.e. when there is no prefix
// compression). Key stability refers to whether the InternalKey.UserKey bytes
// returned by a positioning call will remain stable after a subsequent
// positioning call. The normal case is that a positioning call will invalidate
// any previously returned InternalKey.UserKey. If a block has a restart
// interval of 1 (no prefix compression), blockIter guarantees that
// InternalKey.UserKey will point to the key as stored in the block itself
// which will remain valid until the blockIter is closed. The key stability
// guarantee is used by the range tombstone code which knows that range
// tombstones are always encoded with a restart interval of 1. This per-block
// key stability guarantee is sufficient for range tombstones as they are
// always encoded in a single block.

Good catch. Updated.

@nicktrav nicktrav force-pushed the nickt.sstable-rangekeys branch from 428046b to 394e6ba Compare December 13, 2021 19:05
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:!

Reviewed 1 of 14 files at r1, 3 of 9 files at r4, all commit messages.
Reviewable status: 8 of 15 files reviewed, all discussions resolved (waiting on @jbowens and @sumeerbhola)

Range keys (see cockroachdb#1341) will be stored in their own, single block of an
sstable. Add a new, optional meta block, indexed as "pebble.range_key"
in the metablock index, to the sstable structure. This block is only
present when at least one range key has been written to the sstable.

Add the ability to add range keys to an sstable via
`(*sstable.Writer).Write`.

Update existing data-driven tests to support printing of the range key
summary. Add additional test coverage demonstrating writing of range
keys with an `sstable.Writer`.

Add minimal functionality to `sstable.Reader` to support writing the
data-driven test cases for the writer. Additional read-oriented
functionality will be added in a subsequent patch.

Related to cockroachdb#1339.
@nicktrav nicktrav force-pushed the nickt.sstable-rangekeys branch from 394e6ba to fde3810 Compare December 13, 2021 20:13
@nicktrav
Copy link
Contributor Author

@sumeerbhola and I chatted offline - I'll merge this as-is, and will follow-up on any comments that come up post-merge.

TFTRs!

@nicktrav nicktrav merged commit 989fa6f into cockroachdb:master Dec 13, 2021
@nicktrav nicktrav deleted the nickt.sstable-rangekeys branch December 13, 2021 22:00
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 15 files reviewed, 1 unresolved discussion


sstable/table.go, line 89 at r5 (raw file):

[index block] (for single level index)
[meta range key block] (optional)
[meta properties block]

The meta properties block is after the rangedel block

nicktrav added a commit to nicktrav/pebble that referenced this pull request Dec 15, 2021
The properties block is written after any range deletion block that may
be present in the sstable.

Fix the docs.

Follow-up from cockroachdb#1400.
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 15 files reviewed, 1 unresolved discussion


sstable/table.go, line 89 at r5 (raw file):

Previously, sumeerbhola wrote…

The meta properties block is after the rangedel block

Woops! Thanks for catching. Fix: #1413.

nicktrav added a commit that referenced this pull request Dec 15, 2021
The properties block is written after any range deletion block that may
be present in the sstable.

Fix the docs.

Follow-up from #1400.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants