Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage,kvserver: rangekeys written with a synthetic MVCC timestamp risk never being unset as part of schema gc #129592

Closed
nicktrav opened this issue Aug 23, 2024 · 8 comments
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.2.1-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. T-storage Storage Team

Comments

@nicktrav
Copy link
Collaborator

nicktrav commented Aug 23, 2024

When an index or table is dropped, a RANGEKEYSET is placed over the corresponding keyspan as a cheap way of masking it. When the GC period for the index / table has passed, the database writes a RANGEKEYUNSET to "annhilate" the RANGEKEYSET. Given how rangekeys work, the unset will only remove the set is if the rangekey suffixes match.

CRDB places the MVCC timestamp in the rangekey suffix. In prior versions of CRDB, the MVCC timestamp could be synthetic. Thus, a RANGEKEYSET could contain a suffix consisting of a synthetic MVCC timestamp. Synthetic timestamps have since been deprecated and removed, see #101938.

However, when performing a schema GC operation in newer version that does not have synthetic timstamps, when we query the rangekeys to GC, the timestamp generated will never be synthetic. Thus, any RANGEKEYSET written with a synthetic timestamp will never be GC'd: while the RANGEKEYUNSETs written will have corresponding spans, the suffixes will never be equal due to the lack of the synthetic bit.

For example, consider a RANGEKEYSET for a span:

RANGEKEYSET: /Table/2710/{NULL-2/NULL} - 17d1905bc26af86300000002010e

Note the trailing 10e on the timestamp.

The corresponding RANGEKEYUNSET generated as part of GC will be:

RANGEKEYUNSET: /Table/271{0/2/NULL-1/NULL} - 17d1905bc26af863000000020d

The latter has a timestamp that is not synthetic.

The DB would ideally be defensive when encountering these old synthetic timestamps as part of GC (and possibly more generally).

Jira issue: CRDB-41605

@nicktrav nicktrav added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. T-kv KV Team branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 labels Aug 23, 2024
@blathers-crl blathers-crl bot added the T-storage Storage Team label Aug 23, 2024
@nicktrav
Copy link
Collaborator Author

I narrowed this down some more. The problem looks to be an issue with the codec for the MVCC timestamp now being "lossy".

When we scan through Pebble looking for range keys, the iterator implementations all boil down to a pebbleIterator, which has RangeKeys and a EngineRangeKeys methods, that both call (*pebble.Iterator).RangeKeys. This gives us the raw Pebble range keys - a start key, and a value, which unpacks to the end key, and a suffix. For MVCC keys, the timestamp is in the suffix.

The suffix that we read from the Pebble store is unadulterated, and is an (old) synthetic key. E.g. 17d1905bc26af86300000002010e (note the trailing \x01 bit after the logical time, 2).

For the (*pebble.Iterator).EngineRangeKeys method, we don't muck with the value, and everything is ok.

However, for (*pebble.Iterator).RangeKeys, it decodes the timestamp through DecodeMVCCTimestampSuffix. And as part of #101938, we stopped putting the synthetic bit in the hlc.Timestamp struct (the field was deprecated in 62ebe56, and we stoped writing to and reading from that field in 3773994 and f05520f, respectively). Therefore, we'll lose the synthetic bit when converting from Pebble's representation to the MVCC representation.

This is a problem for rangekeys, because when we write the RANGEKEYUNSET down here in ClearMVCCRangeKey, we call EncodeMVCCTimestampSuffix to generate the suffix, which no longer has knowledge of the synthetic bit, and thus wont write it. We end up with a suffix like 17d1905bc26af863000000020d, which them means this RANGEKEYUNSET cannot wipe out the RANGEKEYSET it was supposed to, because the suffixes don't match.

Here's a test that demonstrates this loss:

func TestTimestampRoundTrip(t *testing.T) {
	const tsHexIn = "17d1905bc26af86300000002010e" // Walltime + Logical + Synthetic
	b, err := hex.DecodeString(tsHexIn)
	require.NoError(t, err)

	// From pebble representation.
	// NOTE: We lose the synthetic bit here.
	ts, err := storage.DecodeMVCCTimestampSuffix(b)
	require.NoError(t, err)

	// Back to pebble representation.
	b = storage.EncodeMVCCTimestampSuffix(ts)
	tsHexOut := hex.EncodeToString(b)

	// Assert we didn't lose anything along the way.
	require.Equal(t, tsHexIn, tsHexOut) // Fails!
}

@nicktrav
Copy link
Collaborator Author

It looks like we then fixed all of this when we fixed the comparators recently, in #128043.

So ... we should be good here?

@nicktrav
Copy link
Collaborator Author

nicktrav commented Aug 24, 2024

I'm going to close this out, as we have the fix in the following release branches:
- 23.2: #128818
- 24.1: #128817
- 24.2: #128816

Edit: The backports above were not actually addressing the comparator issue. See below.

@github-project-automation github-project-automation bot moved this from Incoming to Done in [Deprecated] Storage Aug 24, 2024
@RaduBerinde RaduBerinde reopened this Aug 24, 2024
@github-project-automation github-project-automation bot moved this from Done to Incoming in [Deprecated] Storage Aug 24, 2024
@RaduBerinde
Copy link
Member

The backports above did not address the issue, reopening to track backporting the comparator fixes.

Copy link

blathers-crl bot commented Aug 27, 2024

Based on the specified backports for linked PR #129605, I applied the following new label(s) to this issue: branch-release-23.1, branch-release-23.2, branch-release-24.2.1-rc. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@RaduBerinde
Copy link
Member

Reopening since the fixes were backed out.

jbowens added a commit to jbowens/cockroach that referenced this issue Sep 12, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In cockroachdb#101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In cockroachdb#129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In cockroachdb#129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence cockroachdb#130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting cockroachdb#101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs cockroachdb#129592.
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 12, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In cockroachdb#101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In cockroachdb#129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In cockroachdb#129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence cockroachdb#130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting cockroachdb#101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs cockroachdb#129592.
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 13, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In cockroachdb#101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In cockroachdb#129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In cockroachdb#129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence cockroachdb#130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting cockroachdb#101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs cockroachdb#129592.
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 17, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In cockroachdb#101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In cockroachdb#129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In cockroachdb#129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence cockroachdb#130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting cockroachdb#101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs cockroachdb#129592.
@nicktrav nicktrav moved this from Incoming to In Progress (this milestone) in [Deprecated] Storage Sep 17, 2024
craig bot pushed a commit that referenced this issue Sep 18, 2024
130453: logictest: revert incorrect test assertion update r=rafiss a=michae2

(Deja vu: this is #121556 all over again.)

103bd54 incorrectly updated the test expectations, likely because the `--rewrite` flag was used on an assertion that has the retry directive.

This commit undoes that change.

Fixes: #130405

Release note: None

130572: storage: GC range keys by unsetting identical suffixes r=jbowens a=jbowens

In CockroachDB's key encoding some keys have multiple logically equivalent but physically distinct encodings. Most notably, in CockroachDB versions 23.2 and earlier keys written to global tables encoded MVCC timestamps with a 'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator when comparing two MVCC timestamp suffixes, specifically outside the context of a full MVCC key. The comparator failed to consider a timestamp with the synthetic bit and a timestamp without the synthetic bit as logically equivalent. There are limited instances where Pebble uses the comparator to compare "bare suffixes," and all instances are constrained to the implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage collection of MVCC delete range tombstones (the single use of range keys within CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range tombstone with a timestamp encoding the synthetic bit. If the cluster subsequently upgraded to 24.1 or later, the code path to clear range keys stopped understanding synthetic bits and wrote range key unset tombstones without the synthetic bit set. Due to the comparator bug, Pebble did not consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but inadvertently introduced the possibility of replica divergence #130533 by changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to write range key unsets using the verbatim suffix that was read from the storage engine. To avoid reverting #101938 and re-introducing knowledge of the synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of the encoded timestamp suffix when reading range keys from storage engine iterators. If later an attempt is made to clear the range key through ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding the timestamp. Through avoiding the decoding/encoding roundtrip, ClearMVCCRangeKey ensures that the suffixes it writes are identical to the range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.

130906: sql: deflake TestValidationWithProtectedTS r=rafiss a=rafiss

This test does not work if ranges get split, so we disable the split queue.

fixes #130715
Release note: None

Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Sep 18, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence #130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting #101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.
blathers-crl bot pushed a commit that referenced this issue Sep 18, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence #130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting #101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.
blathers-crl bot pushed a commit that referenced this issue Sep 18, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence #130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting #101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.
@Schtick Schtick added the O-postmortem Originated from a Postmortem action item. label Sep 18, 2024
jbowens added a commit that referenced this issue Sep 18, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence #130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting #101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.
@jbowens
Copy link
Collaborator

jbowens commented Sep 18, 2024

Fixed on 24.1, 24.2, 24.2.2 staging, 24.2.3-rc and master.

@jbowens jbowens closed this as completed Sep 18, 2024
@github-project-automation github-project-automation bot moved this from In Progress (this milestone) to Done in [Deprecated] Storage Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.2.1-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. T-storage Storage Team
Projects
Archived in project
Status: Closed
Development

No branches or pull requests

4 participants