-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: GC range keys by unsetting identical suffixes #130572
Conversation
590a08c
to
2de835f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very nice and clean! I will take a second more detailed look later today.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master
still has the comparator fix, so I'd expect the range_keys_clear_synthetic_bit
test to pass regardless of the change (can you verify that?) We should try the patch on release-24.2
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I confirmed that the tests pass on 24.2
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @jbowens, and @sumeerbhola)
-- commits
line 40 at r1:
Nice writeup on the history of this issue!
pkg/storage/pebble_iterator.go
line 750 at r1 (raw file):
Timestamp: timestamp, Value: rangeKey.Value, EncodedTimestampSuffix: rangeKey.Suffix,
Will this memory be invalidated when the iterator is advanced?
The reason I had suggested a copySuffix
param in the doc was so that we could conditionally perform a memcpy here, but maybe that isn't needed.
pkg/storage/mvcc_history_test.go
line 1939 at r1 (raw file):
// range key writes will use and we want to exercise it. See #129592. if e.hasArg("syntheticBit") { suffix := storage.EncodeMVCCTimestampSuffixWithSyntheticBitForTesting(rangeKey.Timestamp)
nit: any reason this isn't inside the withWriter
function with the EncodeMVCCValue
call?
pkg/storage/mvcc_test.go
line 5404 at r1 (raw file):
} else { rw := mvccRangeKeyEncodedTimestampReadWriter{engine} // rw := engine
nit: stale comment?
pkg/storage/mvcc_test.go
line 5416 at r1 (raw file):
// mvccRangeKeyEncodedTimestampReadWriter wraps a ReadWriter, overriding // PutMVCCRangeKey so that if MVCCRangeKey.EncodedTimestampSuffix is non-empty, // the resulting range key uses the encoded suffix varbatim. This is a bit of a
s/varbatim/verbatim/
pkg/storage/mvcc_test.go
line 5423 at r1 (raw file):
// TODO(jackson): Remove this when we've guaranteed that all range keys have // timestamps without the synthetic bit. type mvccRangeKeyEncodedTimestampReadWriter struct {
Consider mentioning somewhere in this comment that the production ReadWriter implementations only use EncodedTimestampSuffix
in their ClearMVCCRangeKey
methods, not in their PutMVCCRangeKey
methods.
2de835f
to
3bc6b99
Compare
3bc6b99
to
f5b52d5
Compare
f5b52d5
to
d2f157d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, @stevendanna, and @sumeerbhola)
pkg/storage/pebble_iterator.go
line 750 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Will this memory be invalidated when the iterator is advanced?
The reason I had suggested a
copySuffix
param in the doc was so that we could conditionally perform a memcpy here, but maybe that isn't needed.
It may (whether it was invalidated is dependent on the value of RangeKeyChanged() after the next iterator operation) , but that's true of the entire MVCCRangeKeyStack
structure we return. The bounds user keys will be invalidated, and we'll reuse the p.mvccRangeKeyVersions
slice to hold stack.Versions
too. We rely on the caller Clone
/CloneInto
-ing if necessary
pkg/storage/mvcc_history_test.go
line 1939 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: any reason this isn't inside the
withWriter
function with theEncodeMVCCValue
call?
Done
pkg/storage/mvcc_test.go
line 5423 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider mentioning somewhere in this comment that the production ReadWriter implementations only use
EncodedTimestampSuffix
in theirClearMVCCRangeKey
methods, not in theirPutMVCCRangeKey
methods.
Done
I think we need it here too? Is there a way to exercise this code in a test? |
d2f157d
to
de0a22e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, @RaduBerinde, @stevendanna, and @sumeerbhola)
pkg/storage/pebble_batch.go
line 762 at r2 (raw file):
Previously, RaduBerinde wrote…
I think we need it here too? Is there a way to exercise this code in a test?
Really good catch—I added a mvcc history test that exercises this code path metamorphically. I started to look at trying to add a GC unit test that could've exercised this code, but it's a bit tricky to get the synthetic bit set on the test data.
de0a22e
to
0cc0604
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @jbowens, @nvanbenschoten, @stevendanna, and @sumeerbhola)
pkg/storage/pebble_batch.go
line 762 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Really good catch—I added a mvcc history test that exercises this code path metamorphically. I started to look at trying to add a GC unit test that could've exercised this code, but it's a bit tricky to get the synthetic bit set on the test data.
Great! I audited all calls to EncodeMVCCTimestampSuffic and haven't found anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 9 files at r1, 8 of 9 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @jbowens, @nvanbenschoten, and @stevendanna)
pkg/storage/pebble_batch.go
line 222 at r3 (raw file):
} return wb.PutEngineRangeKey( rangeKey.StartKey, rangeKey.EndKey, EncodeMVCCTimestampSuffix(rangeKey.Timestamp), value)
is this also a case where the encoded suffix, if present, could be used?
Was this deliberately omitted since it is used in AddSSTable and export, and we don't want to preserve the synthetic bit in those cases?
SSTWriter
also has this method.
pkg/storage/sst_writer.go
line 270 at r3 (raw file):
// write the tombstone to clear it using the same encoding of the timestamp. // See #129592. if len(rangeKey.EncodedTimestampSuffix) > 0 {
just checking my understanding: is it accurate that we never actually unset rangekeys via writing to a sst?
pkg/kv/kvserver/batcheval/cmd_add_sstable_test.go
line 1269 at r3 (raw file):
v.LocalTimestamp.WallTime *= 1e9 kv.RangeKey.Timestamp.WallTime *= 1e9 kv.RangeKey.EncodedTimestampSuffix = storage.EncodeMVCCTimestampSuffix(kv.RangeKey.Timestamp)
since this is optional, should it randomly choose between setting and not setting, so we don't end up with code that relies on it being set?
pkg/kv/kvserver/batcheval/cmd_delete_range_test.go
line 210 at r3 (raw file):
EndKey: roachpb.Key(tc.end), Timestamp: ts, EncodedTimestampSuffix: storage.EncodeMVCCTimestampSuffix(ts),
ditto
pkg/kv/kvserver/batcheval/knobs_use_range_tombstones_test.go
line 88 at r3 (raw file):
case storage.MVCCRangeKeyValue: kv.RangeKey.Timestamp = hlc.Timestamp{} kv.RangeKey.EncodedTimestampSuffix = nil
why explicitly setting it to nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, @stevendanna, and @sumeerbhola)
pkg/storage/pebble_batch.go
line 222 at r3 (raw file):
Previously, sumeerbhola wrote…
is this also a case where the encoded suffix, if present, could be used?
Was this deliberately omitted since it is used in AddSSTable and export, and we don't want to preserve the synthetic bit in those cases?
SSTWriter
also has this method.
Yeah, that's right. We deliberately omitted it so that we explicitly avoid propagating the synthetic bit
pkg/storage/sst_writer.go
line 270 at r3 (raw file):
Previously, sumeerbhola wrote…
just checking my understanding: is it accurate that we never actually unset rangekeys via writing to a sst?
yeah, that's right
pkg/kv/kvserver/batcheval/cmd_add_sstable_test.go
line 1269 at r3 (raw file):
Previously, sumeerbhola wrote…
since this is optional, should it randomly choose between setting and not setting, so we don't end up with code that relies on it being set?
unfortunately almost all the tests rely on deep equality of the range key, so we need to maintain the suffix on the expectation or update all the tests
pkg/kv/kvserver/batcheval/knobs_use_range_tombstones_test.go
line 88 at r3 (raw file):
Previously, sumeerbhola wrote…
why explicitly setting it to nil?
To match the above explicit clearing of the Timestamp
(this test does some wonky things)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @jbowens, @nvanbenschoten, and @stevendanna)
pkg/storage/pebble_batch.go
line 222 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Yeah, that's right. We deliberately omitted it so that we explicitly avoid propagating the synthetic bit
Could you add a code comment pointing this out, else someone might think it is an oversight.
In CockroachDB's key encoding some keys have multiple logically equivalent but physically distinct encodings. Most notably, in CockroachDB versions 23.2 and earlier keys written to global tables encoded MVCC timestamps with a 'synthetic bit.' In cockroachdb#101938 CockroachDB stopped encoding and decoding this synthetic bit, transparently ignoring it. In cockroachdb#129592 we observed the existence of a bug in the CockroachDB comparator when comparing two MVCC timestamp suffixes, specifically outside the context of a full MVCC key. The comparator failed to consider a timestamp with the synthetic bit and a timestamp without the synthetic bit as logically equivalent. There are limited instances where Pebble uses the comparator to compare "bare suffixes," and all instances are constrained to the implementation of range keys. In cockroachdb#129592 it was observed that the comparator bug could prevent the garbage collection of MVCC delete range tombstones (the single use of range keys within CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range tombstone with a timestamp encoding the synthetic bit. If the cluster subsequently upgraded to 24.1 or later, the code path to clear range keys stopped understanding synthetic bits and wrote range key unset tombstones without the synthetic bit set. Due to the comparator bug, Pebble did not consider these timestamp suffixes equal and the unset was ineffective. We initially attempted to fix this issue by fixing the comparator, but inadvertently introduced the possibility of replica divergence cockroachdb#130533 by changing the semantics of LSM state below raft. This commit works around this comparator bug by adapting ClearMVCCRangeKey to write range key unsets using the verbatim suffix that was read from the storage engine. To avoid reverting cockroachdb#101938 and re-introducing knowledge of the synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of the encoded timestamp suffix when reading range keys from storage engine iterators. If later an attempt is made to clear the range key through ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding the timestamp. Through avoiding the decoding/encoding roundtrip, ClearMVCCRangeKey ensures that the suffixes it writes are identical to the range keys that exist on disk, even if they encode a synthetic bit. Release note (bug fix): Fixes a bug that could result in the inability to garbage collect a MVCC range tombstone within a global table. Epic: none Informs cockroachdb#129592.
0cc0604
to
75dcbec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 9 files at r1, 8 of 9 files at r2, 2 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @jbowens, @nvanbenschoten, @stevendanna, and @sumeerbhola)
TFTRs! bors r+ |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 75dcbec to blathers/backport-release-24.1-130572: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 24.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
blathers backport staging-v24.2.2 |
blathers backport release-24.2.3-rc |
In CockroachDB's key encoding some keys have multiple logically equivalent but physically distinct encodings. Most notably, in CockroachDB versions 23.2 and earlier keys written to global tables encoded MVCC timestamps with a 'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this synthetic bit, transparently ignoring it.
In #129592 we observed the existence of a bug in the CockroachDB comparator when comparing two MVCC timestamp suffixes, specifically outside the context of a full MVCC key. The comparator failed to consider a timestamp with the synthetic bit and a timestamp without the synthetic bit as logically equivalent. There are limited instances where Pebble uses the comparator to compare "bare suffixes," and all instances are constrained to the implementation of range keys.
In #129592 it was observed that the comparator bug could prevent the garbage collection of MVCC delete range tombstones (the single use of range keys within CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range tombstone with a timestamp encoding the synthetic bit. If the cluster subsequently upgraded to 24.1 or later, the code path to clear range keys stopped understanding synthetic bits and wrote range key unset tombstones without the synthetic bit set. Due to the comparator bug, Pebble did not consider these timestamp suffixes equal and the unset was ineffective.
We initially attempted to fix this issue by fixing the comparator, but inadvertently introduced the possibility of replica divergence #130533 by changing the semantics of LSM state below raft.
This commit works around this comparator bug by adapting ClearMVCCRangeKey to write range key unsets using the verbatim suffix that was read from the storage engine. To avoid reverting #101938 and re-introducing knowledge of the synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of the encoded timestamp suffix when reading range keys from storage engine iterators. If later an attempt is made to clear the range key through ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding the timestamp. Through avoiding the decoding/encoding roundtrip, ClearMVCCRangeKey ensures that the suffixes it writes are identical to the range keys that exist on disk, even if they encode a synthetic bit.
Release note (bug fix): Fixes a bug that could result in the inability to garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.