Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-24.2: storage: GC range keys by unsetting identical suffixes #130940

Merged
merged 1 commit into from
Sep 18, 2024

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Sep 18, 2024

Backport 1/1 commits from #130572 on behalf of @jbowens.

/cc @cockroachdb/release


In CockroachDB's key encoding some keys have multiple logically equivalent but physically distinct encodings. Most notably, in CockroachDB versions 23.2 and earlier keys written to global tables encoded MVCC timestamps with a 'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator when comparing two MVCC timestamp suffixes, specifically outside the context of a full MVCC key. The comparator failed to consider a timestamp with the synthetic bit and a timestamp without the synthetic bit as logically equivalent. There are limited instances where Pebble uses the comparator to compare "bare suffixes," and all instances are constrained to the implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage collection of MVCC delete range tombstones (the single use of range keys within CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range tombstone with a timestamp encoding the synthetic bit. If the cluster subsequently upgraded to 24.1 or later, the code path to clear range keys stopped understanding synthetic bits and wrote range key unset tombstones without the synthetic bit set. Due to the comparator bug, Pebble did not consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but inadvertently introduced the possibility of replica divergence #130533 by changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to write range key unsets using the verbatim suffix that was read from the storage engine. To avoid reverting #101938 and re-introducing knowledge of the synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of the encoded timestamp suffix when reading range keys from storage engine iterators. If later an attempt is made to clear the range key through ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding the timestamp. Through avoiding the decoding/encoding roundtrip, ClearMVCCRangeKey ensures that the suffixes it writes are identical to the range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.


Release justification: Fixes high-severity issue that could prevent MVCC range tombstones from being garbage collected.

In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence #130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting #101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.
@blathers-crl blathers-crl bot requested a review from a team as a code owner September 18, 2024 14:50
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-24.2-130572 branch from 76e60fd to b6446e6 Compare September 18, 2024 14:50
@blathers-crl blathers-crl bot requested review from a team as code owners September 18, 2024 14:50
@blathers-crl blathers-crl bot requested a review from jbowens September 18, 2024 14:50
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Sep 18, 2024
@blathers-crl blathers-crl bot requested review from kev-cao, aadityasondhi, nvanbenschoten, RaduBerinde, stevendanna and sumeerbhola and removed request for a team September 18, 2024 14:50
Copy link
Author

blathers-crl bot commented Sep 18, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Sep 18, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @jbowens, @kev-cao, @nvanbenschoten, @RaduBerinde, and @stevendanna)

@jbowens
Copy link
Collaborator

jbowens commented Sep 18, 2024

TFTR!

@jbowens jbowens merged commit 8962b7d into release-24.2 Sep 18, 2024
19 of 20 checks passed
@jbowens jbowens deleted the blathers/backport-release-24.2-130572 branch September 18, 2024 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants