-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sstable: add suffix replacement block transform #1314
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to flesh out the testing more, probably coping some of the testing from #1233. I figured I'd put it up for an initial review first.
Reviewable status: 0 of 12 files reviewed, all discussions resolved
dece871
to
f90f07d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good.
When configured with a suffix placeholder, the sstable writer requires that all keys' suffixes (as determined by Split) exactly match the provided SuffixPlaceholder.
I like the requirement that all keys match the suffix. Are there potential concerns that this could be too strong a requirement? Can these sstables sometimes contain MVCC-compliant DeleteRange? That should be ok since we encode the timestamp in the same manner.
Reviewed 8 of 12 files at r1, all commit messages.
Reviewable status: 8 of 12 files reviewed, 4 unresolved discussions (waiting on @dt and @jbowens)
internal/base/internal.go, line 82 at r2 (raw file):
InternalKeyRangeDeleteSentinel = (InternalKeySeqNumMax << 8) | InternalKeyKindRangeDelete // InternalKeyTrailerLength is the length of the internal key trailer
... is the length in bytes ...
sstable/block.go, line 25 at r2 (raw file):
type blockWriter struct { // split, if set, will prevent sharing any portion of the key after
nit: repetition of "after"
sstable/options.go, line 211 at r2 (raw file):
// When a SuffixPlaceholder is set, all added keys that contain a // prefix must have a suffix exactly equal to the SuffixPlaceholder. // That is, if Split(k) < len(k), then k[Split(k):] must equal
This "that contain a prefix" qualification is interesting.
Just curious whether we already have a planned use for it.
sstable/reader.go, line 2451 at r2 (raw file):
} split := len(r.Properties.SuffixReplacement) >> 1 if !r.unreplacedSuffixOK && bytes.Equal(r.Properties.SuffixReplacement[:split], r.Properties.SuffixReplacement[split:]) {
what if the replacement happens to be the same as the original suffix? Is that disallowed since it will fail the check here?
77cf843
to
52acaa1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there potential concerns that this could be too strong a requirement?
Good question -- Do any of a range's keyspaces contain both keys that require MVCC timestamp replacement and keys that must not undergo timestamp replacement but still contain MVCC timestamps? I believe the answer to be no.
Reviewable status: 3 of 12 files reviewed, 3 unresolved discussions (waiting on @dt and @sumeerbhola)
internal/base/internal.go, line 82 at r2 (raw file):
Previously, sumeerbhola wrote…
... is the length in bytes ...
Done.
sstable/options.go, line 211 at r2 (raw file):
Previously, sumeerbhola wrote…
This "that contain a prefix" qualification is interesting.
Just curious whether we already have a planned use for it.
I didn't have anything planned, but it seemed imaginable that we may one day have timestamp-suffixed and non-timestamp-suffixed keys in the same keyspace (eg, range local), and we'd want to perform replacement on the former but not on the latter.
sstable/reader.go, line 2451 at r2 (raw file):
Previously, sumeerbhola wrote…
what if the replacement happens to be the same as the original suffix? Is that disallowed since it will fail the check here?
Yeah, I figured requiring the placeholder be a unique suffix was okay and worth the lack of ambiguity.
Do we have any requirement that the original suffix actually equal a valid MVCC timestamp encoding? During AddSSTables that fail due to range splits, I think we do read back constructed sstable in order to split it into sides of the range boundary. So I guess we do, and the original timestamp needs to correctly parse as an MVCC timestamp even if it isn't used.
52acaa1
to
ca7b2f7
Compare
I think that's right: we're adding this feature so that any sst being added writes all its keys at addition time so I think it is okay to say every key in it must use the replacement suffix if it wants to enable suffix replacement.
yeah, though we could iterate it with something less mvcc-ful if we needed to, but we might as well just make it a valid timestamp. my prototype just used a giant future time like 0f00 << 62 and logical 1 or something like that which I figured would be readily recognized if we ever saw it show up un-replaced. |
ca7b2f7
to
942c295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall!
@dt did a benchmark (see internal Slack thread) of his prototype that showed negligible overhead (1-3%) when reading synthesized timestamps. I expect similar overhead here, but can we run a few benchmarks to check?
Do any of a range's keyspaces contain both keys that require MVCC timestamp replacement and keys that must not undergo timestamp replacement but still contain MVCC timestamps? I believe the answer to be no.
I also think that's right. In the common case we have to uphold MVCC invariants, so all writes must be at the current timestamp (thus all keys' timestamps can be replaced). In the exceptional cases where we want to write historical data (e.g. during streaming replication into virgin keyspans), I do not see why we would need timestamp replacement.
I guess the one edge case to be mindful of here is intents, whose interleaved form had MVCC timestamp 0. Although now that we have migrated to separated intents, which use a different timestamp-less key encoding, I don't think it's a problem. I'm also not sure why we would need to write intents in an SSTable that uses timestamp replacement.
In any case, I suppose there's nothing stopping us from relaxing this requirement later, if necessary (?).
we might as well just make it a valid timestamp. my prototype just used a giant future time like 0f00 << 62 and logical 1 or something like that which I figured would be readily recognized if we ever saw it show up un-replaced.
This seems reasonable.
Reviewed 3 of 12 files at r1, 6 of 8 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
sstable/reader.go, line 1988 at r6 (raw file):
// concatenated with the replacement value. Values are required to // be equal length. The SuffixReplacement property was already // validated in when the reader was constructed.
nit: stray "in"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any of a range's keyspaces contain both keys that require MVCC timestamp replacement and keys that must not undergo timestamp replacement but still contain MVCC timestamps? I believe the answer to be no.
Hm, I do see that PutUnversioned
writes an MVCCKey
with timestamp 0:
Does a 0 timestamp matter here?
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, I suppose there's nothing stopping us from relaxing this requirement later, if necessary (?).
Yeah, I think that's right.
Does a 0 timestamp matter here?
I think that's okay. The zero timestamp is serialized as a single zero byte indicating no version. The Split
function we use omits that zero byte from the suffix:
Split: func(k []byte) int {
key, ok := GetKeyPartFromEngineKey(k)
if !ok {
return len(k)
}
// Pebble requires that keys generated via a split be comparable with
// normal encoded engine keys. Encoded engine keys have a suffix
// indicating the number of bytes of version data. Engine keys without a
// version have a suffix of 0. We're careful in EncodeKey to make sure
// that the user-key always has a trailing 0. If there is no version this
// falls out naturally. If there is a version we prepend a 0 to the
// encoded version data.
return len(key) + 1
},
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started prototyping AddSSTable
integration in cockroachdb/cockroach#71658. One thing that's a bit awkward is that the caller (e.g. an IMPORT) first has to construct the SSTable with the SuffixReplacement
option and appropriate timestamps, but then also call AddSSTable
with the placeholder timestamp as an RPC parameter such that the command can call ReplaceSuffix(..., placeholder, replacement)
. Instead, ReplaceSuffix
could simply read the SuffixReplacement
property from the SST, which would obviate the need to pass around duplicate parameters (both as an SST property and as an RPC request parameter).
If we do so, it would be nice if ReplaceSuffix
returned a typed error when the SST does not contain a SuffixReplacement
property, such that AddSSTable
can optimistically try to replace the timestamp and keep going if it's not valid for the SST. Alternatively, we can construct a new reader to check the property first -- I don't know how much overhead that would have.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 8 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
sstable/options.go, line 211 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I didn't have anything planned, but it seemed imaginable that we may one day have timestamp-suffixed and non-timestamp-suffixed keys in the same keyspace (eg, range local), and we'd want to perform replacement on the former but not on the latter.
Based on the discussion can we strengthen this now to require that when SuffixPlaceholder
is set, all keys have that suffix? We could relax it later if needed, but being stricter to start with seems better for correctness.
sstable/reader.go, line 2451 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Yeah, I figured requiring the placeholder be a unique suffix was okay and worth the lack of ambiguity.
Do we have any requirement that the original suffix actually equal a valid MVCC timestamp encoding? During AddSSTables that fail due to range splits, I think we do read back constructed sstable in order to split it into sides of the range boundary. So I guess we do, and the original timestamp needs to correctly parse as an MVCC timestamp even if it isn't used.
We should say something in the spec in options.go
that the original suffix result in a valid key, plus it should be unused, so that replacement never provides the same suffix.
sstable/reader.go, line 2000 at r6 (raw file):
switch { case split == len(key.UserKey): // This key has no suffix. Skip it.
this would become an error
sstable/writer_test.go, line 168 at r6 (raw file):
require.NoError(t, err) require.NoError(t, f.Close()) require.NoError(t, ReplaceSuffix(sstableBytes, readOpts, []byte(` world`), []byte(` monde`)))
nice test :)
In CockroachDB, there exist processes that build and ingest sstables. These sstables have timestamp-suffixed MVCC keys. Today, these keys' timestamps are dated in the past and rewrite history. This rewriting violates invariants in parts of the system. We would like to support ingesting these sstables with recent, invariant-maintaing MVCC timestamps. However, ingestion is used during bulk operations, and rewriting large sstables' keys with a recent MVCC timestamp is infeasibly expensive. This change introduces a facility for constructing an sstable with a placeholder suffix. When using this facility, a caller specifies a SuffixPlaceholder write option. The caller is also required to configure a Comparer that contains a non-nil Split function. When configured with a suffix placeholder, the sstable writer requires that all keys' suffixes (as determined by Split) exactly match the provided SuffixPlaceholder. An sstable constructed in this fashion is still incomplete and unable to be read unless explicitly permitted through the AllowUnreplacedSuffix option. When a caller would like to complete an sstable constructed with a suffix placeholder, they may call ReplaceSuffix providing the original placeholder value and the replacement value. The placeholder and replacement values are required to be equal lengths. ReplaceSuffix performs an O(1) write to record the replacement value. After a suffix replacement the resulting sstable is complete, and sstable readers may read the sstable. Readers detect the sstable property and apply a block transform to replace suffix placeholders with the replacement value on the fly as blocks are loaded. Informs cockroachdb/cockroach#70422.
942c295
to
747770e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead,
ReplaceSuffix
could simply read theSuffixReplacement
property from the SST, which would obviate the need to pass around duplicate parameters (both as an SST property and as an RPC request parameter).
I think the extra bookkeeping might be worth the explicitness in this case.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @dt and @sumeerbhola)
sstable/options.go, line 211 at r2 (raw file):
Previously, sumeerbhola wrote…
Based on the discussion can we strengthen this now to require that when
SuffixPlaceholder
is set, all keys have that suffix? We could relax it later if needed, but being stricter to start with seems better for correctness.
Done.
Doing this revealed an oversight: We were relying on Separator
and Successor
functions to not change the placeholder suffix. I think a reasonable, simple option is to use identity function Separator
and Successor
implementations when configured with a SuffixPlaceholder
, which I implemented here.
sstable/reader.go, line 2451 at r2 (raw file):
Previously, sumeerbhola wrote…
We should say something in the spec in
options.go
that the original suffix result in a valid key, plus it should be unused, so that replacement never provides the same suffix.
Done.
sstable/reader.go, line 2000 at r6 (raw file):
Previously, sumeerbhola wrote…
this would become an error
Done.
sstable/writer_test.go, line 168 at r6 (raw file):
Previously, sumeerbhola wrote…
nice test :)
thanks! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r7, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
sstable/options.go, line 203 at r7 (raw file):
// value. // // When a SuffixPlaceholder is set, all added keys must contain a a
nit: extra "a"
sstable/reader.go, line 2000 at r7 (raw file):
switch { case split == len(key.UserKey): panic(errors.New("pebble: suffix replacement sstable contains key without a suffix"))
I didn't expect to see a panic here, since this could happen due to corruption. Return error instead?
sstable/writer.go, line 813 at r7 (raw file):
if len(o.SuffixPlaceholder) > 0 { w.separator = func(dst, a, b []byte) []byte { return a } w.successor = func(dst, a []byte) []byte { return a }
It would be nice to have a unit test that would have failed because the index was lying about the range of keys in a block.
FWIW, on the cockroach side, I wonder if we can just have well-known "placeholder timestamp", e.g. defined by pkg/storage, which is what a wrapped SSTWriter uses and is what evalAddSSTable always passes to its |
Yeah, I think this Pebble API gives us enough flexibility to shape the KV API more or less as we want. Your suggestion makes sense. We could also e.g. read the property from the SST before passing it to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r7, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
// SuffixReplacement configures sstable readers to apply a block // transform replacing a configured placeholder key suffix with a // new suffix of the same length. The property value must be an even
Timestamps appear to use a variable-length encoding (based on the logical and/or synthetic components):
We don't know the final timestamp until we're evaluating the AddSSTable
request -- which happens way after the SSTable has been generated. So in order to get a fixed-length timestamp we'll have to use the fully expanded 13-byte encoding which explicitly includes all components. The decoder will handle this correctly:
However, this means that we no longer use a unique timestamp encoding. Can this cause any problems for Pebble?
Also, I suppose this means that when we're expanding out these replacements during compaction, we'll end up storing more data than strictly necessary (5 bytes per key in the worst case). Not sure if there's much to do about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Timestamps appear to use a variable-length encoding (based on the logical and/or synthetic components):
We don't know the final timestamp until we're evaluating the
AddSSTable
request -- which happens way after the SSTable has been generated. So in order to get a fixed-length timestamp we'll have to use the fully expanded 13-byte encoding which explicitly includes all components. The decoder will handle this correctly:However, this means that we no longer use a unique timestamp encoding. Can this cause any problems for Pebble?
Also, I suppose this means that when we're expanding out these replacements during compaction, we'll end up storing more data than strictly necessary (5 bytes per key in the worst case). Not sure if there's much to do about that.
How about allowing the placeholder and replacement to have different lengths? We could set an upper bound at e.g. 256 bytes and use length-prefixed encoding in a fixed-size buffer. This is based on the possibly naïve assumption that the fixed-length restriction is only to be able to update the replacement suffix in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
How about allowing the placeholder and replacement to have different lengths? We could set an upper bound at e.g. 256 bytes and use length-prefixed encoding in a fixed-size buffer. This is based on the possibly naïve assumption that the fixed-length restriction is only to be able to update the replacement suffix in place.
I'm not sure I understand what you mean by "we no longer use a unique timestamp encoding"?
I was thinking we're probably not going to feel the extra bytes since we're adding a bunch of keys with identical bytes, so snappy should be pretty effective on them, or were you thinking in the block cache.
If we were trying to save space in the block cache, we'd need to re-allocate the smaller block and then shift everything down during replacement, which could mean moving big values around, vs just editing a few key bytes now. I'm dubious it is worth it to sometimes save 4b per key? (nit: it is 4, right? just the potentially extra zero logical? we don't need to leave room for a synthetic byte do we?)
I suppose we could at least remove extra zero logical during compaction, with a custom merge op that runs iff the sst has the replacement prop, so they don't keep taking up space, but I still suspect it isn't worth worrying about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
I'm not sure I understand what you mean by "we no longer use a unique timestamp encoding"?
Previously, any given MVCC timestamp would encode to exactly one binary representation in Pebble. If we have to use fixed-length timestamps with these SST replacements then that is no longer true, as the same MVCC timestamp can have two different representations in Pebble: one variable-length (often 8 bytes) and one fixed-length (13 bytes). This could presumably cause problems with comparisons, where two timestamps that are semantically equal appear different to Pebble. I don't know if this will end up actually mattering, hence the question, but it's the sort of thing it's worth being careful about and I think it would be nice if we could avoid it altogether.
I was thinking we're probably not going to feel the extra bytes since we're adding a bunch of keys with identical bytes, so snappy should be pretty effective on them, or were you thinking in the block cache.
I think the space thing is a secondary concern, and you're probably right that it won't matter all that much post-compression and such. But if we were to allow variable-length replacements then the space savings would be a nice added benefit.
(nit: it is 4, right? just the potentially extra zero logical? we don't need to leave room for a synthetic byte do we?)
I'm not sure if the synthetic byte actually matters at this level, but it is part of the binary encoding so presumably there's a reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
Previously, any given MVCC timestamp would encode to exactly one binary representation in Pebble. If we have to use fixed-length timestamps with these SST replacements then that is no longer true, as the same MVCC timestamp can have two different representations in Pebble: one variable-length (often 8 bytes) and one fixed-length (13 bytes). This could presumably cause problems with comparisons, where two timestamps that are semantically equal appear different to Pebble. I don't know if this will end up actually mattering, hence the question, but it's the sort of thing it's worth being careful about and I think it would be nice if we could avoid it altogether.
This is a good catch. It does matter as we do bytes comparison on the suffix in a few places. See https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble.go#L109-L119.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Previously, any given MVCC timestamp would encode to exactly one binary representation in Pebble. If we have to use fixed-length timestamps with these SST replacements then that is no longer true, as the same MVCC timestamp can have two different representations in Pebble: one variable-length (often 8 bytes) and one fixed-length (13 bytes). This could presumably cause problems with comparisons, where two timestamps that are semantically equal appear different to Pebble. I don't know if this will end up actually mattering, hence the question, but it's the sort of thing it's worth being careful about and I think it would be nice if we could avoid it altogether.
This is a good catch. It does matter as we do bytes comparison on the suffix in a few places. See https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble.go#L109-L119.
Ah, I see. I'd assumed that was ok -- that to pebble these were distinct keys, and only at the higher level cockroach process did we map them to semantically the same timestamp? Where does that go wrong? I suppose if something like GC were to find some expired revisions and then generate a batch of deletions by re-encoding MVCCKey, the batch would be deleting the non-existent implicit zero, and leaving the explicit zero? Are you thinking it is on pebble to canonicalize to the short form?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
Previously, dt (David Taylor) wrote…
Ah, I see. I'd assumed that was ok -- that to pebble these were distinct keys, and only at the higher level cockroach process did we map them to semantically the same timestamp? Where does that go wrong? I suppose if something like GC were to find some expired revisions and then generate a batch of deletions by re-encoding MVCCKey, the batch would be deleting the non-existent implicit zero, and leaving the explicit zero? Are you thinking it is on pebble to canonicalize to the short form?
What can go wrong? We think two keys are not equal when in fact they are equivalent (same logical MVCC timestamp, but different physical representations). I'm not sure if this will actually matter anywhere, but it would affect DB.Get()
if we happened to use that (which we don't). It might affect overwriting or deleting keys. Auditing the existing code to see if this matters will be error prone. We could easily miss something. Requiring Pebble to do the canonicalization seems wrong as Pebble doesn't actually know that these different physical representations are logically equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @dt, @jbowens, @petermattis, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
Previously, petermattis (Peter Mattis) wrote…
What can go wrong? We think two keys are not equal when in fact they are equivalent (same logical MVCC timestamp, but different physical representations). I'm not sure if this will actually matter anywhere, but it would affect
DB.Get()
if we happened to use that (which we don't). It might affect overwriting or deleting keys. Auditing the existing code to see if this matters will be error prone. We could easily miss something. Requiring Pebble to do the canonicalization seems wrong as Pebble doesn't actually know that these different physical representations are logically equivalent.
Yeah, I really think we should avoid this if we can. Even if it turns out not to matter right now, I'd be surprised if it didn't come back to haunt us later.
Is there a compelling reason why we can't allow variable-length replacements (with an upper bound)? I believe that would avoid this entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
Is there a compelling reason why we can't allow variable-length replacements (with an upper bound)? I believe that would avoid this entirely.
The encoding of a KV in an sstable block is <varint><key-bytes><varint><value-bytes>
. So if you change the length of the key you have to rewrite the value and all subsequent KVs in the block. I think Jackson alluded to this above. If we want to change the length of the suffix we'd have to rewrite the block. Perhaps this isn't too bad as we're only doing this when the block is loaded. It just makes the transform more complex.
We also do byte comparisons of timestamp in https://github.com/cockroachdb/cockroach/blob/c13e44998ea0b2f3105b888c9d90c2c364fad123/pkg/storage/pebble.go#L284-L291 I believe what we have is:
The possibility that the same k@ts could be duplicated in an LSM because of two encodings is worrisome. This could also result in replica divergence.
+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Is there a compelling reason why we can't allow variable-length replacements (with an upper bound)? I believe that would avoid this entirely.
The encoding of a KV in an sstable block is
<varint><key-bytes><varint><value-bytes>
. So if you change the length of the key you have to rewrite the value and all subsequent KVs in the block. I think Jackson alluded to this above. If we want to change the length of the suffix we'd have to rewrite the block. Perhaps this isn't too bad as we're only doing this when the block is loaded. It just makes the transform more complex.
I don't love having to shuffle the whole block down, particularly if the values are big.
I wonder if we could avoid this complexity elsewhere instead, e.g. by sticking to the same-length replacement rule but then requiring cockroach to figure out how to avoid the problems that come from using semantically identical but physically different keys. One idea is that we could just demand that AddSSTable requests that request suffix replacement with their batch timestamp have a full 13b batch timestamp using the normal encoding rules, which would imply that they require that they be invoked with a batch timestamp with a non-zero logical component. I haven't thought this all the way through, but I think we could just always tick the logical in the client sending those requests (though I think most of the time today the client doesn't set a ts on those at all and the store assign it instead, so we'd either need it to be in on this or stop doing that I guess)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @dt, @jbowens, @petermattis, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
I've been thinking along the same lines, let me chew on it for a bit.
I think we could just always tick the logical in the client sending those requests
The txn's write timestamp might still get pushed to one without a logical component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @jbowens, @nvanbenschoten, and @sumeerbhola)
sstable/properties.go, line 140 at r7 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I've been thinking along the same lines, let me chew on it for a bit.
I think we could just always tick the logical in the client sending those requests
The txn's write timestamp might still get pushed to one without a logical component.
Huh, so I was asking @nvanbenschoten to help me remember what we could and couldn't assume/enforce about batch timestamps, and he pointed out that if we're making these just like regular writes, then keys written to gloabl_reads spans will have to have synthetic = 1 too, so we're even more variable length.
Closing this out in favor of write-time replacement (eg, #1384). |
@nvanbenschoten see the discussion starting at #1314 (review) @erikgrinaker @jbowens @dt: Nathan has another case where two different suffix byte slices must compare as equal and he is planning to change the comparison and equality functions in Pebble (this is for local timestamps cockroachdb/cockroach#77342). |
I'd be very careful with this, since it wouldn't surprise me if we just did
We ended up rewriting SSTs at request time, and avoiding it altogether when the request wasn't pushed, which got us to within 5% overhead. That seems like a simpler and more robust solution for our use in any case. But I'll keep this in mind if we should need it sometime (ranged intents with constant-time commits come to mind). |
The local timestamps are going into the key because of the following discussion cockroachdb/cockroach#72121 (comment), and it is likely that older version values are going to be in value sections of an sstable for v22.2. |
I see. I realize that I'm late to the party here, but all else being equal, I have a strong preference for keeping this timestamp in the value unless we have a demonstrated significant performance gain. Keys should be unique identifiers, and this timestamp is not an identifier. I'll reiterate my concerns from the previous discussion above. |
We already provide Pebble a custom key comparator, so I'm still in the process of confirming that this will have a negligible impact on performance. From a quick glance around, it appears that Pebble's If this ends up being workable then it will be trivial and cheap to normalize a zero logical component to no logical component in the two comparator functions. |
This commit fixes the bug revealed in the previous commit and sets the stage for generalized MVCC version normalization during key comparison, which will be needed for cockroachdb#77342. To do so, the commit adds a normalization step to EngineKeyCompare, the custom `Compare` function we provide to Pebble. This normalization pass currently strips the synthetic bit from version timestamps, which fixes the bug revealed in the previous commit. The normalization pass also strips zero-valued logical components, which are typically not encoded into MVCCKeys today, but may be in the future (for instance, see cockroachdb/pebble#1314). In cockroachdb#77342, we can then extend this to strip the encoded logical timestamp, if present. In addition to updating the existing custom key comparator function passed to Pebble, the commit also introduces a new custom key equality function. This new function, called EngineKeyEqual, is provided to Pebble as its `Equal` function, replacing the default key equality function of `bytes.Equal`. EngineKeyEqual uses the same version normalization rules to strip portions of the key's version that should not affect ordering. The relationship between the different comparators is explored in a new property based unit test called `TestMVCCKeyCompareRandom`. The change allows us to say that for any two `MVCCKeys` `a` and `b`, the following identities hold: ``` a.Compare(b) = EngineKeyCompare(encode(a), encode(b)) a.Equal(b) = EngineKeyEqual(encode(a), encode(b)) (a.Compare(b) == 0) = a.Equal(b) (a.Compare(b) < 0) = a.Less(b) (a.Compare(b) > 0) = b.Less(a) ``` Care was taken to minimize the cost of this version normalization. With EngineKeyCompare, the normalization amounts to four new branches that should all be easy for a branch predictor to learn. With EngineKeyEqual, there is more of a concern that this change will regress performance because we switch from a direct call to `bytes.Equal` to a custom comparator. To minimize this cost, the change adds a fast-path to quickly defer to `bytes.Equal` when version normalization is not needed. Benchmarks show that with this fast-path and with an expected distribution of keys, the custom key equality function is about 2.5ns more expensive per call. This seems reasonable. ``` name time/op MVCCKeyCompare-10 12.2ns ± 1% MVCCKeyEqual-10 7.10ns ± 6% BytesEqual-10 4.72ns ± 2% ``` Release note: None. Release justification: None. Not intended for v22.1.
This commit fixes the bug revealed in the previous commit and sets the stage for generalized MVCC version normalization during key comparison, which will be needed for cockroachdb#77342. To do so, the commit adds a normalization step to EngineKeyCompare, the custom `Compare` function we provide to Pebble. This normalization pass currently strips the synthetic bit from version timestamps, which fixes the bug revealed in the previous commit. The normalization pass also strips zero-valued logical components, which are typically not encoded into MVCCKeys today, but may be in the future (for instance, see cockroachdb/pebble#1314). In cockroachdb#77342, we can then extend this to strip the encoded logical timestamp, if present. In addition to updating the existing custom key comparator function passed to Pebble, the commit also introduces a new custom key equality function. This new function, called EngineKeyEqual, is provided to Pebble as its `Equal` function, replacing the default key equality function of `bytes.Equal`. EngineKeyEqual uses the same version normalization rules to strip portions of the key's version that should not affect ordering. The relationship between the different comparators is explored in a new property based unit test called `TestMVCCKeyCompareRandom`. The change allows us to say that for any two `MVCCKeys` `a` and `b`, the following identities hold: ``` a.Compare(b) = EngineKeyCompare(encode(a), encode(b)) a.Equal(b) = EngineKeyEqual(encode(a), encode(b)) (a.Compare(b) == 0) = a.Equal(b) (a.Compare(b) < 0) = a.Less(b) (a.Compare(b) > 0) = b.Less(a) ``` Care was taken to minimize the cost of this version normalization. With EngineKeyCompare, the normalization amounts to four new branches that should all be easy for a branch predictor to learn. With EngineKeyEqual, there is more of a concern that this change will regress performance because we switch from a direct call to `bytes.Equal` to a custom comparator. To minimize this cost, the change adds a fast-path to quickly defer to `bytes.Equal` when version normalization is not needed. Benchmarks show that with this fast-path and with an expected distribution of keys, the custom key equality function is about 2.5ns more expensive per call. This seems reasonable. ``` name time/op MVCCKeyCompare-10 12.2ns ± 1% MVCCKeyEqual-10 7.10ns ± 6% BytesEqual-10 4.72ns ± 2% ``` Release note: None. Release justification: None. Not intended for v22.1.
This commit fixes the bug revealed in the previous commit and sets the stage for generalized MVCC version normalization during key comparison, which will be needed for cockroachdb#77342. To do so, the commit adds a normalization step to EngineKeyCompare, the custom `Compare` function we provide to Pebble. This normalization pass currently strips the synthetic bit from version timestamps, which fixes the bug revealed in the previous commit. The normalization pass also strips zero-valued logical components, which are typically not encoded into MVCCKeys today, but may be in the future (for instance, see cockroachdb/pebble#1314). In cockroachdb#77342, we can then extend this to strip the encoded logical timestamp, if present. In addition to updating the existing custom key comparator function passed to Pebble, the commit also introduces a new custom key equality function. This new function, called EngineKeyEqual, is provided to Pebble as its `Equal` function, replacing the default key equality function of `bytes.Equal`. EngineKeyEqual uses the same version normalization rules to strip portions of the key's version that should not affect ordering. The relationship between the different comparators is explored in a new property based unit test called `TestMVCCKeyCompareRandom`. The change allows us to say that for any two `MVCCKeys` `a` and `b`, the following identities hold: ``` a.Compare(b) = EngineKeyCompare(encode(a), encode(b)) a.Equal(b) = EngineKeyEqual(encode(a), encode(b)) (a.Compare(b) == 0) = a.Equal(b) (a.Compare(b) < 0) = a.Less(b) (a.Compare(b) > 0) = b.Less(a) ``` Care was taken to minimize the cost of this version normalization. With EngineKeyCompare, the normalization amounts to four new branches that should all be easy for a branch predictor to learn. With EngineKeyEqual, there is more of a concern that this change will regress performance because we switch from a direct call to `bytes.Equal` to a custom comparator. To minimize this cost, the change adds a fast-path to quickly defer to `bytes.Equal` when version normalization is not needed. Benchmarks show that with this fast-path and with an expected distribution of keys, the custom key equality function is about 2.5ns more expensive per call. This seems reasonable. ``` name time/op MVCCKeyCompare-10 12.2ns ± 1% MVCCKeyEqual-10 7.10ns ± 6% BytesEqual-10 4.72ns ± 2% ``` Release note: None. Release justification: None. Not intended for v22.1.
This commit fixes the bug revealed in the previous commit and sets the stage for generalized MVCC version normalization during key comparison, which will be needed for cockroachdb#77342. To do so, the commit adds a normalization step to EngineKeyCompare, the custom `Compare` function we provide to Pebble. This normalization pass currently strips the synthetic bit from version timestamps, which fixes the bug revealed in the previous commit. The normalization pass also strips zero-valued logical components, which are typically not encoded into MVCCKeys today, but may be in the future (for instance, see cockroachdb/pebble#1314). In cockroachdb#77342, we can then extend this to strip the encoded logical timestamp, if present. In addition to updating the existing custom key comparator function passed to Pebble, the commit also introduces a new custom key equality function. This new function, called EngineKeyEqual, is provided to Pebble as its `Equal` function, replacing the default key equality function of `bytes.Equal`. EngineKeyEqual uses the same version normalization rules to strip portions of the key's version that should not affect ordering. The relationship between the different comparators is explored in a new property based unit test called `TestMVCCKeyCompareRandom`. The change allows us to say that for any two `MVCCKeys` `a` and `b`, the following identities hold: ``` a.Compare(b) = EngineKeyCompare(encode(a), encode(b)) a.Equal(b) = EngineKeyEqual(encode(a), encode(b)) (a.Compare(b) == 0) = a.Equal(b) (a.Compare(b) < 0) = a.Less(b) (a.Compare(b) > 0) = b.Less(a) ``` Care was taken to minimize the cost of this version normalization. With EngineKeyCompare, the normalization amounts to four new branches that should all be easy for a branch predictor to learn. With EngineKeyEqual, there is more of a concern that this change will regress performance because we switch from a direct call to `bytes.Equal` to a custom comparator. To minimize this cost, the change adds a fast-path to quickly defer to `bytes.Equal` when version normalization is not needed. Benchmarks show that with this fast-path and with an expected distribution of keys, the custom key equality function is about 2.5ns more expensive per call. This seems reasonable. ``` name time/op MVCCKeyCompare-10 12.2ns ± 1% MVCCKeyEqual-10 7.10ns ± 6% BytesEqual-10 4.72ns ± 2% ``` Release note: None. Release justification: None. Not intended for v22.1.
77520: storage: normalize MVCC version timestamps during key comparison r=nvanbenschoten a=nvanbenschoten Related to #77342. This commit fixes the bug revealed in the previous commit and sets the stage for generalized MVCC version normalization during key comparison, which will be needed for #77342. To do so, the commit adds a normalization step to EngineKeyCompare, the custom `Compare` function we provide to Pebble. This normalization pass currently strips the synthetic bit from version timestamps, which fixes the bug revealed in the previous commit. The normalization pass also strips zero-valued logical components, which are typically not encoded into MVCCKeys today, but may be in the future (for instance, see cockroachdb/pebble#1314). In #77342, we can then extend this to strip the encoded logical timestamp, if present. In addition to updating the existing custom key comparator function passed to Pebble, the commit also introduces a new custom key equality function. This new function, called EngineKeyEqual, is provided to Pebble as its `Equal` function, replacing the default key equality function of `bytes.Equal`. EngineKeyEqual uses the same version normalization rules to strip portions of the key's version that should not affect ordering. The relationship between the different comparators is explored in a new property based unit test called `TestMVCCKeyCompareRandom`. The change allows us to say that for any two `MVCCKeys` `a` and `b`, the following identities hold: ``` a.Compare(b) = EngineKeyCompare(encode(a), encode(b)) a.Equal(b) = EngineKeyEqual(encode(a), encode(b)) (a.Compare(b) == 0) = a.Equal(b) (a.Compare(b) < 0) = a.Less(b) (a.Compare(b) > 0) = b.Less(a) ``` Care was taken to minimize the cost of this version normalization. With EngineKeyCompare, the normalization amounts to four new branches that should all be easy for a branch predictor to learn. With EngineKeyEqual, there is more of a concern that this change will regress performance because we switch from a direct call to `bytes.Equal` to a custom comparator. To minimize this cost, the change adds a fast-path to quickly defer to `bytes.Equal` when version normalization is not needed. Benchmarks show that with this fast-path and with an expected distribution of keys, the custom key equality function is about 2.5ns more expensive per call. This seems reasonable. ``` name time/op MVCCKeyCompare-10 12.2ns ± 1% MVCCKeyEqual-10 7.10ns ± 6% BytesEqual-10 4.72ns ± 2% ``` Release note: None. Release justification: None. Not intended for v22.1. 78350: backupccl: refactor getBackupDetailsAndManifest r=benbardin a=benbardin This should be a complete no-op from a functionality perspective. The smaller, more encapsulated pieces should be easier to reason about. Specifically, getBackupDetailsAndManifest has now become a scaffolding function, calling the new updateBackupDetails, createBackupManifest, and validateBackupDetailsAndManifest in turn. These functions have a narrower focus and are a bit easier to follow. It's also now a bit clearer where one would go to, say, add more validation conditions. Release note: None 79091: bazel: upgrade to bazel 5.1.0 r=mari-crl a=rickystewart Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Ben Bardin <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
In CockroachDB, there exist processes that build and ingest sstables.
These sstables have timestamp-suffixed MVCC keys. Today, these keys'
timestamps are dated in the past and rewrite history. This rewriting
violates invariants in parts of the system. We would like to support
ingesting these sstables with recent, invariant-maintaing MVCC
timestamps. However, ingestion is used during bulk operations, and
rewriting large sstables' keys with a recent MVCC timestamp is
infeasibly expensive.
This change introduces a facility for constructing an sstable with a
placeholder suffix. When using this facility, a caller specifies a
SuffixPlaceholder write option. The caller is also required to configure
a Comparer that contains a non-nil Split function. When configured with
a suffix placeholder, the sstable writer requires that all keys'
suffixes (as determined by Split) exactly match the provided
SuffixPlaceholder. An sstable constructed in this fashion is still
incomplete and unable to be read unless explicitly permitted through the
AllowUnreplacedSuffix option.
When a caller would like to complete an sstable constructed with a
suffix placeholder, they may call ReplaceSuffix providing the
original placeholder value and the replacement value. The placeholder
and replacement values are required to be equal lengths. ReplaceSuffix
performs an O(1) write to record the replacement value.
After a suffix replacement the resulting sstable is complete, and
sstable readers may read the sstable. Readers detect the sstable
property and apply a block transform to replace suffix placeholders with
the replacement value on the fly as blocks are loaded.
See cockroachdb/cockroach#70422.