Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: GC range keys by unsetting identical suffixes #130572

Merged
merged 1 commit into from
Sep 18, 2024

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Sep 12, 2024

In CockroachDB's key encoding some keys have multiple logically equivalent but physically distinct encodings. Most notably, in CockroachDB versions 23.2 and earlier keys written to global tables encoded MVCC timestamps with a 'synthetic bit.' In #101938 CockroachDB stopped encoding and decoding this synthetic bit, transparently ignoring it.

In #129592 we observed the existence of a bug in the CockroachDB comparator when comparing two MVCC timestamp suffixes, specifically outside the context of a full MVCC key. The comparator failed to consider a timestamp with the synthetic bit and a timestamp without the synthetic bit as logically equivalent. There are limited instances where Pebble uses the comparator to compare "bare suffixes," and all instances are constrained to the implementation of range keys.

In #129592 it was observed that the comparator bug could prevent the garbage collection of MVCC delete range tombstones (the single use of range keys within CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range tombstone with a timestamp encoding the synthetic bit. If the cluster subsequently upgraded to 24.1 or later, the code path to clear range keys stopped understanding synthetic bits and wrote range key unset tombstones without the synthetic bit set. Due to the comparator bug, Pebble did not consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but inadvertently introduced the possibility of replica divergence #130533 by changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to write range key unsets using the verbatim suffix that was read from the storage engine. To avoid reverting #101938 and re-introducing knowledge of the synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of the encoded timestamp suffix when reading range keys from storage engine iterators. If later an attempt is made to clear the range key through ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding the timestamp. Through avoiding the decoding/encoding roundtrip, ClearMVCCRangeKey ensures that the suffixes it writes are identical to the range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs #129592.

@jbowens jbowens requested a review from a team as a code owner September 12, 2024 15:51
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jbowens jbowens added backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.2.x Flags PRs that need to be backported to 24.2 labels Sep 12, 2024
@jbowens jbowens force-pushed the synth-unset branch 2 times, most recently from 590a08c to 2de835f Compare September 12, 2024 16:29
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very nice and clean! I will take a second more detailed look later today.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, and @sumeerbhola)

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master still has the comparator fix, so I'd expect the range_keys_clear_synthetic_bit test to pass regardless of the change (can you verify that?) We should try the patch on release-24.2

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, and @sumeerbhola)

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I confirmed that the tests pass on 24.2

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, and @sumeerbhola)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @jbowens, and @sumeerbhola)


-- commits line 40 at r1:
Nice writeup on the history of this issue!


pkg/storage/pebble_iterator.go line 750 at r1 (raw file):

			Timestamp:              timestamp,
			Value:                  rangeKey.Value,
			EncodedTimestampSuffix: rangeKey.Suffix,

Will this memory be invalidated when the iterator is advanced?

The reason I had suggested a copySuffix param in the doc was so that we could conditionally perform a memcpy here, but maybe that isn't needed.


pkg/storage/mvcc_history_test.go line 1939 at r1 (raw file):

	// range key writes will use and we want to exercise it. See #129592.
	if e.hasArg("syntheticBit") {
		suffix := storage.EncodeMVCCTimestampSuffixWithSyntheticBitForTesting(rangeKey.Timestamp)

nit: any reason this isn't inside the withWriter function with the EncodeMVCCValue call?


pkg/storage/mvcc_test.go line 5404 at r1 (raw file):

		} else {
			rw := mvccRangeKeyEncodedTimestampReadWriter{engine}
			// rw := engine

nit: stale comment?


pkg/storage/mvcc_test.go line 5416 at r1 (raw file):

// mvccRangeKeyEncodedTimestampReadWriter wraps a ReadWriter, overriding
// PutMVCCRangeKey so that if MVCCRangeKey.EncodedTimestampSuffix is non-empty,
// the resulting range key uses the encoded suffix varbatim. This is a bit of a

s/varbatim/verbatim/


pkg/storage/mvcc_test.go line 5423 at r1 (raw file):

// TODO(jackson): Remove this when we've guaranteed that all range keys have
// timestamps without the synthetic bit.
type mvccRangeKeyEncodedTimestampReadWriter struct {

Consider mentioning somewhere in this comment that the production ReadWriter implementations only use EncodedTimestampSuffix in their ClearMVCCRangeKey methods, not in their PutMVCCRangeKey methods.

@jbowens jbowens requested a review from a team as a code owner September 12, 2024 19:04
@jbowens jbowens requested a review from a team as a code owner September 12, 2024 19:45
@jbowens jbowens requested review from stevendanna and removed request for a team September 12, 2024 19:45
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, @stevendanna, and @sumeerbhola)


pkg/storage/pebble_iterator.go line 750 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Will this memory be invalidated when the iterator is advanced?

The reason I had suggested a copySuffix param in the doc was so that we could conditionally perform a memcpy here, but maybe that isn't needed.

It may (whether it was invalidated is dependent on the value of RangeKeyChanged() after the next iterator operation) , but that's true of the entire MVCCRangeKeyStack structure we return. The bounds user keys will be invalidated, and we'll reuse the p.mvccRangeKeyVersions slice to hold stack.Versions too. We rely on the caller Clone/CloneInto-ing if necessary


pkg/storage/mvcc_history_test.go line 1939 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: any reason this isn't inside the withWriter function with the EncodeMVCCValue call?

Done


pkg/storage/mvcc_test.go line 5423 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider mentioning somewhere in this comment that the production ReadWriter implementations only use EncodedTimestampSuffix in their ClearMVCCRangeKey methods, not in their PutMVCCRangeKey methods.

Done

@RaduBerinde
Copy link
Member

pkg/storage/pebble_batch.go line 762 at r2 (raw file):

			for _, v := range rangeKeys.Versions {
				if err := p.batch.RangeKeyUnset(startRaw, endRaw,
					EncodeMVCCTimestampSuffix(v.Timestamp), nil); err != nil {

I think we need it here too? Is there a way to exercise this code in a test?

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, @RaduBerinde, @stevendanna, and @sumeerbhola)


pkg/storage/pebble_batch.go line 762 at r2 (raw file):

Previously, RaduBerinde wrote…

I think we need it here too? Is there a way to exercise this code in a test?

Really good catch—I added a mvcc history test that exercises this code path metamorphically. I started to look at trying to add a GC unit test that could've exercised this code, but it's a bit tricky to get the synthetic bit set on the test data.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @jbowens, @nvanbenschoten, @stevendanna, and @sumeerbhola)


pkg/storage/pebble_batch.go line 762 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Really good catch—I added a mvcc history test that exercises this code path metamorphically. I started to look at trying to add a GC unit test that could've exercised this code, but it's a bit tricky to get the synthetic bit set on the test data.

Great! I audited all calls to EncodeMVCCTimestampSuffic and haven't found anything else.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 9 files at r1, 8 of 9 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @jbowens, @nvanbenschoten, and @stevendanna)


pkg/storage/pebble_batch.go line 222 at r3 (raw file):

	}
	return wb.PutEngineRangeKey(
		rangeKey.StartKey, rangeKey.EndKey, EncodeMVCCTimestampSuffix(rangeKey.Timestamp), value)

is this also a case where the encoded suffix, if present, could be used?
Was this deliberately omitted since it is used in AddSSTable and export, and we don't want to preserve the synthetic bit in those cases?

SSTWriter also has this method.


pkg/storage/sst_writer.go line 270 at r3 (raw file):

	// write the tombstone to clear it using the same encoding of the timestamp.
	// See #129592.
	if len(rangeKey.EncodedTimestampSuffix) > 0 {

just checking my understanding: is it accurate that we never actually unset rangekeys via writing to a sst?


pkg/kv/kvserver/batcheval/cmd_add_sstable_test.go line 1269 at r3 (raw file):

								v.LocalTimestamp.WallTime *= 1e9
								kv.RangeKey.Timestamp.WallTime *= 1e9
								kv.RangeKey.EncodedTimestampSuffix = storage.EncodeMVCCTimestampSuffix(kv.RangeKey.Timestamp)

since this is optional, should it randomly choose between setting and not setting, so we don't end up with code that relies on it being set?


pkg/kv/kvserver/batcheval/cmd_delete_range_test.go line 210 at r3 (raw file):

						EndKey:                 roachpb.Key(tc.end),
						Timestamp:              ts,
						EncodedTimestampSuffix: storage.EncodeMVCCTimestampSuffix(ts),

ditto


pkg/kv/kvserver/batcheval/knobs_use_range_tombstones_test.go line 88 at r3 (raw file):

		case storage.MVCCRangeKeyValue:
			kv.RangeKey.Timestamp = hlc.Timestamp{}
			kv.RangeKey.EncodedTimestampSuffix = nil

why explicitly setting it to nil?

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @nvanbenschoten, @stevendanna, and @sumeerbhola)


pkg/storage/pebble_batch.go line 222 at r3 (raw file):

Previously, sumeerbhola wrote…

is this also a case where the encoded suffix, if present, could be used?
Was this deliberately omitted since it is used in AddSSTable and export, and we don't want to preserve the synthetic bit in those cases?

SSTWriter also has this method.

Yeah, that's right. We deliberately omitted it so that we explicitly avoid propagating the synthetic bit


pkg/storage/sst_writer.go line 270 at r3 (raw file):

Previously, sumeerbhola wrote…

just checking my understanding: is it accurate that we never actually unset rangekeys via writing to a sst?

yeah, that's right


pkg/kv/kvserver/batcheval/cmd_add_sstable_test.go line 1269 at r3 (raw file):

Previously, sumeerbhola wrote…

since this is optional, should it randomly choose between setting and not setting, so we don't end up with code that relies on it being set?

unfortunately almost all the tests rely on deep equality of the range key, so we need to maintain the suffix on the expectation or update all the tests


pkg/kv/kvserver/batcheval/knobs_use_range_tombstones_test.go line 88 at r3 (raw file):

Previously, sumeerbhola wrote…

why explicitly setting it to nil?

To match the above explicit clearing of the Timestamp (this test does some wonky things)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @jbowens, @nvanbenschoten, and @stevendanna)


pkg/storage/pebble_batch.go line 222 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Yeah, that's right. We deliberately omitted it so that we explicitly avoid propagating the synthetic bit

Could you add a code comment pointing this out, else someone might think it is an oversight.

In CockroachDB's key encoding some keys have multiple logically equivalent but
physically distinct encodings. Most notably, in CockroachDB versions 23.2 and
earlier keys written to global tables encoded MVCC timestamps with a
'synthetic bit.' In cockroachdb#101938 CockroachDB stopped encoding and decoding this
synthetic bit, transparently ignoring it.

In cockroachdb#129592 we observed the existence of a bug in the CockroachDB comparator
when comparing two MVCC timestamp suffixes, specifically outside the context of
a full MVCC key. The comparator failed to consider a timestamp with the
synthetic bit and a timestamp without the synthetic bit as logically
equivalent. There are limited instances where Pebble uses the comparator to
compare "bare suffixes," and all instances are constrained to the
implementation of range keys.

In cockroachdb#129592 it was observed that the comparator bug could prevent the garbage
collection of MVCC delete range tombstones (the single use of range keys within
CockroachDB). A cluster running 23.2 or earlier may write a MVCC delete range
tombstone with a timestamp encoding the synthetic bit. If the cluster
subsequently upgraded to 24.1 or later, the code path to clear range keys
stopped understanding synthetic bits and wrote range key unset tombstones
without the synthetic bit set. Due to the comparator bug, Pebble did not
consider these timestamp suffixes equal and the unset was ineffective.

We initially attempted to fix this issue by fixing the comparator, but
inadvertently introduced the possibility of replica divergence cockroachdb#130533 by
changing the semantics of LSM state below raft.

This commit works around this comparator bug by adapting ClearMVCCRangeKey to
write range key unsets using the verbatim suffix that was read from the storage
engine. To avoid reverting cockroachdb#101938 and re-introducing knowledge of the
synthetic bit, the MVCCRangeKey data structures are adapted to retain a copy of
the encoded timestamp suffix when reading range keys from storage engine
iterators. If later an attempt is made to clear the range key through
ClearMVCCRangeKey, this encoded timestamp suffix is used instead of re-encoding
the timestamp. Through avoiding the decoding/encoding roundtrip,
ClearMVCCRangeKey ensures that the suffixes it writes are identical to the
range keys that exist on disk, even if they encode a synthetic bit.

Release note (bug fix): Fixes a bug that could result in the inability to
garbage collect a MVCC range tombstone within a global table.
Epic: none
Informs cockroachdb#129592.
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 9 files at r1, 8 of 9 files at r2, 2 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @jbowens, @nvanbenschoten, @stevendanna, and @sumeerbhola)

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 18, 2024

TFTRs!

bors r+

@craig craig bot merged commit 68fc953 into cockroachdb:master Sep 18, 2024
23 checks passed
Copy link

blathers-crl bot commented Sep 18, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 75dcbec to blathers/backport-release-24.1-130572: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@jbowens jbowens deleted the synth-unset branch September 18, 2024 14:55
@rail
Copy link
Member

rail commented Sep 18, 2024

blathers backport staging-v24.2.2

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 18, 2024

blathers backport release-24.2.3-rc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.2.x Flags PRs that need to be backported to 24.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants