Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: replica inconsistency after upgrade to v24.2.1 #130533

Closed
RaduBerinde opened this issue Sep 11, 2024 · 5 comments
Closed

storage: replica inconsistency after upgrade to v24.2.1 #130533

RaduBerinde opened this issue Sep 11, 2024 · 5 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. P-2 Issues/test failures with a fix SLA of 3 months regression Regression from a release. T-storage Storage Team

Comments

@RaduBerinde
Copy link
Member

RaduBerinde commented Sep 11, 2024

Background

In older releases, we used to (in some cases) append an extra synthetic indicator byte to timestamp. These timestamps have been deprecated in v22.2 (#101938), but they can still persist in existing KVs.

Since a difference in this indicator bit isn't supposed to cause two timestamps to not be equal, the engine key comparer (passed to Pebble) takes it into account when comparing timestamps. We recently found a bug in the engine key comparer implementation (#127914): while the implementation correctly ignores the synthetic bit when comparing two keys with timestamps, it does not ignore it when only timestamps themselves are compared. The latter happens only in the context of range keys. In particular, when unsetting a range key, the unset is only effective if the timestamps match. If the range key was set with the synthetic bit, an Unset issued by a recent version against it would be ineffective. Not long after discovering this, we had a production cluster hit this issue (#129592).

Fix that went into v24.2.1

In v24.2.1, to address #129592 we merged a fix to the comparer (#129605). Unfortunately, we now found that this fix can cause replica inconsistency (which causes nodes to crash) once any nodes are upgraded.

A detailed sequence (aptly described by @jbowens):

  • A 23.2 cluster writes a MVCC range tombstone with a timestamp with the synthetic bit.
  • The cluster upgrades to 24.1.x (and perhaps 24.2.0)
  • MVCC GC runs, writing an ineffectual RANGEKEYUNSET to remove the MVCC range tombstone.
  • Node n1 compacts the RANGEKEYUNSET into L6, eliding it. Note that all replicas are still consistent.
  • Node n2 upgrades to v24.2.1. The n2 LSM still contains the relevant RANGEKEYUNSET because it has not been compacted into L6 yet. The replica on n2 has diverged from n1, because n2 no longer considers the MVCC range tombstone to be live,
  • whereas n1 considers the RANGEKEYSET live.

The comparer fix was backed out from all branches and #129592 was reopened. v24.2.1 is the only released version with that change.

CC @miraradeva @nvanbenschoten @nicktrav

Jira issue: CRDB-42109

@RaduBerinde RaduBerinde added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. regression Regression from a release. labels Sep 11, 2024
Copy link

blathers-crl bot commented Sep 11, 2024

Hi @RaduBerinde, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@RaduBerinde RaduBerinde added branch-master Failures and bugs on the master branch. branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 labels Sep 11, 2024
@RaduBerinde
Copy link
Member Author

Note the reopened issue #129592 will be used to track an alternate fix for that issue. This issue will track undoing the fix on master and working out any guidance we need to provide for clusters already upgraded to v24.2.1.

@RaduBerinde
Copy link
Member Author

The condition for a synthetic timestamp to exist in a cluster is whether a global table has been used. Any write to a global table (including the rangedel to clear the table when dropped) will get a synthetic bit in versions <= 23.2. (thanks @nvanbenschoten)

jbowens added a commit to jbowens/cockroach that referenced this issue Sep 12, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In cockroachdb#101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In cockroachdb#129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In cockroachdb#129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence cockroachdb#130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting cockroachdb#101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs cockroachdb#129592.
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 12, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In cockroachdb#101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In cockroachdb#129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In cockroachdb#129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence cockroachdb#130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting cockroachdb#101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs cockroachdb#129592.
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 13, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In cockroachdb#101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In cockroachdb#129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In cockroachdb#129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence cockroachdb#130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting cockroachdb#101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs cockroachdb#129592.
@Schtick Schtick added the O-postmortem Originated from a Postmortem action item. label Sep 13, 2024
@Schtick Schtick added the P-2 Issues/test failures with a fix SLA of 3 months label Sep 13, 2024
@exalate-issue-sync exalate-issue-sync bot added the T-storage Storage Team label Sep 13, 2024
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Sep 13, 2024
@RaduBerinde RaduBerinde self-assigned this Sep 13, 2024
@RaduBerinde
Copy link
Member Author

This issue also tracks backing out the range key timestamp comparison behavior from master, which is more tricky than a simple revert. I am working on that but I will be out next week; will have a PR early after that.

jbowens added a commit to jbowens/cockroach that referenced this issue Sep 17, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In cockroachdb#101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In cockroachdb#129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In cockroachdb#129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence cockroachdb#130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting cockroachdb#101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs cockroachdb#129592.
@nicktrav nicktrav moved this from Incoming to In Progress (this milestone) in [Deprecated] Storage Sep 17, 2024
craig bot pushed a commit that referenced this issue Sep 18, 2024
130453: logictest: revert incorrect test assertion update r=rafiss a=michae2

(Deja vu: this is #121556 all over again.)

103bd54 incorrectly updated the test expectations, likely because the `--rewrite` flag was used on an assertion that has the retry directive.

This commit undoes that change.

Fixes: #130405

Release note: None

130572: storage: GC range keys by unsetting identical suffixes r=jbowens a=jbowens

In CockroachDB's key encoding some keys have multiple logically equivalent but physically distinct encodings. Most notably, in CockroachDB versions 23.2 and earlier keys written to global tables encoded MVCC timestamps with a 'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator when comparing two MVCC timestamp suffixes, specifically outside the context of a full MVCC key. The comparator failed to consider a timestamp with the synthetic bit and a timestamp without the synthetic bit as logically equivalent. There are limited instances where Pebble uses the comparator to compare "bare suffixes," and all instances are constrained to the implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage collection of MVCC delete range tombstones (the single use of range keys within CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range tombstone with a timestamp encoding the synthetic bit. If the cluster subsequently upgraded to 24.1 or later, the code path to clear range keys stopped understanding synthetic bits and wrote range key unset tombstones without the synthetic bit set. Due to the comparator bug, Pebble did not consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but inadvertently introduced the possibility of replica divergence #130533 by changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to write range key unsets using the verbatim suffix that was read from the storage engine. To avoid reverting #101938 and re-introducing knowledge of the synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of the encoded timestamp suffix when reading range keys from storage engine iterators. If later an attempt is made to clear the range key through ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding the timestamp. Through avoiding the decoding/encoding roundtrip, ClearMVCCRangeKey ensures that the suffixes it writes are identical to the range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.

130906: sql: deflake TestValidationWithProtectedTS r=rafiss a=rafiss

This test does not work if ranges get split, so we disable the split queue.

fixes #130715
Release note: None

Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Sep 18, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence #130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting #101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.
blathers-crl bot pushed a commit that referenced this issue Sep 18, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence #130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting #101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.
blathers-crl bot pushed a commit that referenced this issue Sep 18, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence #130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting #101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.
jbowens added a commit that referenced this issue Sep 18, 2024
In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence #130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting #101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.
RaduBerinde added a commit to RaduBerinde/pebble that referenced this issue Sep 24, 2024
This change allows `CompareSuffixes` to be stricter than `Compare`
(when the prefixes are equal). This will allow reverting the CRDB
comparer behavior to be consistent with previous releases (avoiding
$replica inconsistency).

Informs cockroachdb/cockroach#130533
RaduBerinde added a commit to cockroachdb/pebble that referenced this issue Sep 24, 2024
This change allows `CompareSuffixes` to be stricter than `Compare`
(when the prefixes are equal). This will allow reverting the CRDB
comparer behavior to be consistent with previous releases (avoiding
$replica inconsistency).

Informs cockroachdb/cockroach#130533
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 25, 2024
The comparer changes effectively revert cockroachdb#128043, which can cause
replica inconsistency during/after upgrades.

Changes:

 * [`01dcf575`](cockroachdb/pebble@01dcf575) base: make comparer tolerate empty keys
 * [`d73ab80f`](cockroachdb/pebble@d73ab80f) db: allow excises to unconditionally be flushable ingests
 * [`b34a3937`](cockroachdb/pebble@b34a3937) base: allow CompareSuffixes to be stricter than Compare
 * [`90356021`](cockroachdb/pebble@90356021) db: refactor replayWAL to use flushes to make versionEdits

Informs: cockroachdb#130533

Release note: none.
Epic: none.
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 26, 2024
The comparer changes effectively revert cockroachdb#128043, which can cause
replica inconsistency during/after upgrades.

Changes:

 * [`0f785fec`](cockroachdb/pebble@0f785fec) metamorphic: abridge failure output
 * [`be56747f`](cockroachdb/pebble@be56747f) db: fix overlap check for flushable ingest excises
 * [`2569414a`](cockroachdb/pebble@2569414a) db: remove race in TestCrashOpenCrashAfterWALCreation
 * [`575f7a04`](cockroachdb/pebble@575f7a04) sstable: support columnar blocks in Layout.Describe
 * [`c88c7471`](cockroachdb/pebble@c88c7471) github: fix code cover publish workflow
 * [`c0fa4a9c`](cockroachdb/pebble@c0fa4a9c) sstable: populate CompareSuffixes on test4bSuffixComparer
 * [`3a76074f`](cockroachdb/pebble@3a76074f) sstable: set IndexPartitions property in columnar sstable writer
 * [`0595c1fb`](cockroachdb/pebble@0595c1fb) colblk: define behavior of KeyWriter.ComparePrev with no previous
 * [`01dcf575`](cockroachdb/pebble@01dcf575) base: make comparer tolerate empty keys
 * [`d73ab80f`](cockroachdb/pebble@d73ab80f) db: allow excises to unconditionally be flushable ingests
 * [`b34a3937`](cockroachdb/pebble@b34a3937) base: allow CompareSuffixes to be stricter than Compare
 * [`90356021`](cockroachdb/pebble@90356021) db: refactor replayWAL to use flushes to make versionEdits

Informs: cockroachdb#130533

Release note: none.
Epic: none.
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Sep 27, 2024
The comparer changes effectively revert cockroachdb#128043, which can cause
replica inconsistency during/after upgrades.

Changes:

 * [`0f785fec`](cockroachdb/pebble@0f785fec) metamorphic: abridge failure output
 * [`be56747f`](cockroachdb/pebble@be56747f) db: fix overlap check for flushable ingest excises
 * [`2569414a`](cockroachdb/pebble@2569414a) db: remove race in TestCrashOpenCrashAfterWALCreation
 * [`575f7a04`](cockroachdb/pebble@575f7a04) sstable: support columnar blocks in Layout.Describe
 * [`c88c7471`](cockroachdb/pebble@c88c7471) github: fix code cover publish workflow
 * [`c0fa4a9c`](cockroachdb/pebble@c0fa4a9c) sstable: populate CompareSuffixes on test4bSuffixComparer
 * [`3a76074f`](cockroachdb/pebble@3a76074f) sstable: set IndexPartitions property in columnar sstable writer
 * [`0595c1fb`](cockroachdb/pebble@0595c1fb) colblk: define behavior of KeyWriter.ComparePrev with no previous
 * [`01dcf575`](cockroachdb/pebble@01dcf575) base: make comparer tolerate empty keys
 * [`d73ab80f`](cockroachdb/pebble@d73ab80f) db: allow excises to unconditionally be flushable ingests
 * [`b34a3937`](cockroachdb/pebble@b34a3937) base: allow CompareSuffixes to be stricter than Compare
 * [`90356021`](cockroachdb/pebble@90356021) db: refactor replayWAL to use flushes to make versionEdits

Informs: cockroachdb#130533

Release note: none.
Epic: none.
craig bot pushed a commit that referenced this issue Sep 27, 2024
131366: go.mod: revert comparer change and bump Pebble to 0f785fec58c0 r=RaduBerinde a=RaduBerinde

The comparer changes effectively revert #128043, which can cause
replica inconsistency during/after upgrades.

Changes:

 * [`0f785fec`](cockroachdb/pebble@0f785fec) metamorphic: abridge failure output
 * [`be56747f`](cockroachdb/pebble@be56747f) db: fix overlap check for flushable ingest excises
 * [`2569414a`](cockroachdb/pebble@2569414a) db: remove race in TestCrashOpenCrashAfterWALCreation
 * [`575f7a04`](cockroachdb/pebble@575f7a04) sstable: support columnar blocks in Layout.Describe
 * [`c88c7471`](cockroachdb/pebble@c88c7471) github: fix code cover publish workflow
 * [`c0fa4a9c`](cockroachdb/pebble@c0fa4a9c) sstable: populate CompareSuffixes on test4bSuffixComparer
 * [`3a76074f`](cockroachdb/pebble@3a76074f) sstable: set IndexPartitions property in columnar sstable writer
 * [`0595c1fb`](cockroachdb/pebble@0595c1fb) colblk: define behavior of KeyWriter.ComparePrev with no previous
 * [`01dcf575`](cockroachdb/pebble@01dcf575) base: make comparer tolerate empty keys
 * [`d73ab80f`](cockroachdb/pebble@d73ab80f) db: allow excises to unconditionally be flushable ingests
 * [`b34a3937`](cockroachdb/pebble@b34a3937) base: allow CompareSuffixes to be stricter than Compare
 * [`90356021`](cockroachdb/pebble@90356021) db: refactor replayWAL to use flushes to make versionEdits

Informs: #130533

Release note: none.
Epic: none.


Co-authored-by: Radu Berinde <[email protected]>
cthumuluru-crdb pushed a commit to cthumuluru-crdb/cockroach that referenced this issue Oct 1, 2024
The comparer changes effectively revert cockroachdb#128043, which can cause
replica inconsistency during/after upgrades.

Changes:

 * [`0f785fec`](cockroachdb/pebble@0f785fec) metamorphic: abridge failure output
 * [`be56747f`](cockroachdb/pebble@be56747f) db: fix overlap check for flushable ingest excises
 * [`2569414a`](cockroachdb/pebble@2569414a) db: remove race in TestCrashOpenCrashAfterWALCreation
 * [`575f7a04`](cockroachdb/pebble@575f7a04) sstable: support columnar blocks in Layout.Describe
 * [`c88c7471`](cockroachdb/pebble@c88c7471) github: fix code cover publish workflow
 * [`c0fa4a9c`](cockroachdb/pebble@c0fa4a9c) sstable: populate CompareSuffixes on test4bSuffixComparer
 * [`3a76074f`](cockroachdb/pebble@3a76074f) sstable: set IndexPartitions property in columnar sstable writer
 * [`0595c1fb`](cockroachdb/pebble@0595c1fb) colblk: define behavior of KeyWriter.ComparePrev with no previous
 * [`01dcf575`](cockroachdb/pebble@01dcf575) base: make comparer tolerate empty keys
 * [`d73ab80f`](cockroachdb/pebble@d73ab80f) db: allow excises to unconditionally be flushable ingests
 * [`b34a3937`](cockroachdb/pebble@b34a3937) base: allow CompareSuffixes to be stricter than Compare
 * [`90356021`](cockroachdb/pebble@90356021) db: refactor replayWAL to use flushes to make versionEdits

Informs: cockroachdb#130533

Release note: none.
Epic: none.
@RaduBerinde
Copy link
Member Author

#131366 reverted the behavior on master. #129620 tracks adding a migration to clean up the synthetic bits in range keys.

@github-project-automation github-project-automation bot moved this from In Progress (this milestone) to Done in [Deprecated] Storage Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. P-2 Issues/test failures with a fix SLA of 3 months regression Regression from a release. T-storage Storage Team
Projects
Archived in project
Development

No branches or pull requests

3 participants