Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvnemesis: committed reverse scan non-atomic timestamps [range key omission] #113973

Closed
erikgrinaker opened this issue Nov 7, 2023 · 14 comments · Fixed by #114189
Closed

kvnemesis: committed reverse scan non-atomic timestamps [range key omission] #113973

erikgrinaker opened this issue Nov 7, 2023 · 14 comments · Fixed by #114189
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 7, 2023

Seen while running kvnemesis on release-23.2 at c9ed4dc, with a kvnemesis patch that disables GuaranteedDurability, ForShare, Snapshot, and ReadCommitted (which have known problems):

committed reverse scan non-atomic timestamps: [rs(skip-locked)]/Table/100/"{309313ee013ee438"-e50cb930984ddea8"}:{0:[1699383532.424828351,0, 1699383532.557955116,0), 1:[1699383534.347766339,0, 1699383535.309620853,0), 2:[1699383542.710004787,0, 1699383543.351077787,0), 3:[1699383543.103706837,0, 1699383544.556847945,0), gap:[<min>, <max>)}->[/Table/100/"e25c418e7167a986":v5, /Table/100/"d95657ab20434c84":v31, /Table/100/"91d4841ef4345355":v68, /Table/100/"35865d504ff127b4":v73]

While running:

$ dev test pkg/kv/kvnemesis -f TestKVNemesisMultiNode -v --stress

Jira issue: CRDB-33279

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Nov 7, 2023
@nvanbenschoten
Copy link
Member

In the failure, we see a reverse scan return two keys which should have been masked by two range tombstones. The range tombstone at the lower timestamp (1699383532.557955116,0) came from an AddSSTable (w2_step1) and the range tombstone tombstone at the higher timestamp (1699383535.309620853,0) came from a call to db.DelRangeUsingTombstone (w3_step16).

@nvanbenschoten
Copy link
Member

Given the involvement of range tombstones, it's possible that this is related to #112221.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 8, 2023

Given that we've so far only seen this failure and the MVCC range key stats failure in #112221 (comment) on release-23.2, it might be an already-fixed bug that hasn't been backported yet. It did fail on master back in #111893 (comment), but not since.

There are some recent Pebble changes that could conceivably be related:

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 8, 2023

The master failure in #111893 (comment) was on 9ae8d5c, with Pebble at ede31f1a8e4b. Here's the surrounding Pebble upgrades on master:

8c0136f 2023-11-06: go.mod: bump Pebble to a0b01b62e8f9
e47c2c8 2023-11-02: go.mod: bump Pebble to 844f0582c2eb
c497daa 2023-11-01: go.mod: bump Pebble to df07a1e1c42c
4904314 2023-10-26: go.mod: bump Pebble to f522e2f9b146
66c3e60 2023-10-20: go.mod: bump Pebble to 593a72bf8976
d43029e 2023-10-12: go.mod: bump Pebble to ede31f1a8e4b

Here's the list of changes in Pebble master that are not backported to the crl-release-23.2 branch: cockroachdb/pebble@crl-release-23.2...master

Building release-23.2 with the latest Pebble master is complicated by Pebble using Go 1.21, while release-23.2 uses Go 1.20.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 8, 2023

As @nvanbenschoten points out, the scan returns two versions that should have been masked by range keys. The scan returns the following versions, with their visible timestamp ranges in brackets:

ReverseScan: /Table/100/"{309313ee013ee438"-e50cb930984ddea8"}

"e25c418e7167a986":v5  @1699383532.424828351,0 [1699383532.424828351,0, 1699383532.557955116,0)
"d95657ab20434c84":v31 @1699383534.347766339,0 [1699383534.347766339,0, 1699383535.309620853,0)
"91d4841ef4345355":v68 @1699383542.710004787,0 [1699383542.710004787,0, 1699383543.351077787,0)
"35865d504ff127b4":v73 @1699383543.103706837,0 [1699383543.103706837,0, 1699383544.556847945,0)

This result is incorrect, because the scan must have happened at a timestamp above 1699383543.103706837,0 to see the latest version, and at that timestamp the versions "e25c418e7167a986":v5 and "d95657ab20434c84":v31 should have been deleted by these MVCC range tombstones:

"e25c418e7167a986":v5 @ 1699383532.424828351,0 deleted by:
  /Table/100/"e16fccf0a88752d5"-/Table/100/"e7a92ec2c3e4f3f7" 1699383535.309620853,0 -> /<empty>
  /Table/100/"e16fccf0a88752d5"-/Table/100/"e7a92ec2c3e4f3f7" 1699383532.893786042,0 -> /<empty>
  /Table/100/"e16fccf0a88752d5"-/Table/100/"e7a92ec2c3e4f3f7" 1699383532.557955116,0 -> /<empty>

"d95657ab20434c84":v31 @ 1699383534.347766339,0 deleted by:
  /Table/100/"d3dae807f6db750c"-/Table/100/"db865e18b1ed42af" 1699383535.309620853,0 -> /<empty>

The reverse scan code path unconditionally enables Pebble range key masking at the read timestamp, because we never emit tombstones via MVCCScanOptions.Tombstones:

opts := storage.MVCCScanOptions{
Inconsistent: h.ReadConsistency != kvpb.CONSISTENT,
SkipLocked: h.WaitPolicy == lock.WaitPolicy_SkipLocked,
Txn: h.Txn,
ScanStats: cArgs.ScanStats,
Uncertainty: cArgs.Uncertainty,
MaxKeys: h.MaxSpanRequestKeys,
MaxLockConflicts: storage.MaxConflictsPerLockConflictError.Get(&cArgs.EvalCtx.ClusterSettings().SV),
TargetBytes: h.TargetBytes,
AllowEmpty: h.AllowEmpty,
WholeRowsOfSize: h.WholeRowsOfSize,
FailOnMoreRecent: args.KeyLockingStrength != lock.None,
Reverse: true,
MemoryAccount: cArgs.EvalCtx.GetResponseMemoryAccount(),
LockTable: cArgs.Concurrency,
DontInterleaveIntents: cArgs.DontInterleaveIntents,
}

cockroach/pkg/storage/mvcc.go

Lines 4575 to 4581 in 3b6e770

iter, err := newMVCCIterator(
ctx, reader, timestamp, !opts.Tombstones, opts.DontInterleaveIntents, IterOptions{
KeyTypes: IterKeyTypePointsAndRanges,
LowerBound: key,
UpperBound: endKey,
},
)

cockroach/pkg/storage/mvcc.go

Lines 1282 to 1286 in 3b6e770

// Enable range key masking if requested.
if rangeKeyMasking && opts.KeyTypes != IterKeyTypePointsOnly &&
opts.RangeKeyMaskingBelow.IsEmpty() {
opts.RangeKeyMaskingBelow = timestamp
}

This leads me to believe this is a bug in Pebble, since these points should have been masked by the iterator. If the points were emitted, they should have been hidden by the pebbleMVCCScanner, which leads me to believe the range keys were not emitted either.

This is consistent with the MVCC stats failures, which always fail due to omitted range keys. I'll add that we never see replica inconsistencies, only MVCC stats discrepancies, so the bug must be deterministic across replicas (i.e. not likely a compaction problem).

I'm currently attempting to bisect across Pebble bumps on master.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 8, 2023

Results:

9ae8d5cd    (Pebble at ede31f1a8e4b) failed: committed serializable txn non-atomic timestamps and RANGE_CONSISTENT_STATS_INCORRECT
66c3e60c416 (Pebble at 593a72bf8976) failed: committed scan non-atomic timestamps and RANGE_CONSISTENT_STATS_INCORRECT
4904314caa3 (Pebble at f522e2f9b146) passed: 800/800 runs (ignoring unrelated timeout failures)

Pebble changes in this interval:

$ git log 593a72bf8976..f522e2f9b146
 f522e2f9 2023-10-26: build: use go1.21 [Nick Travers]
 422dce91 2023-10-25: db: add Errorf to Logger interface [Jackson Owens]
 38191218 2023-10-25: db: retry manual compactions that get cancelled [Bilal Akhtar]
 ec8d19ff 2023-10-23: vfs/errorfs: add Randomly [Jackson Owens]
 babd592d 2023-10-16: db: defer loading L0 range key blocks during iterator construction [Jackson Owens]
 dda4d0d2 2023-10-17: vfs/errorfs: add OpFileReadAt predicate [Jackson Owens]

I guess cockroachdb/pebble@babd592d seems most likely, I'll try to cherry-pick it onto release-23.2.

@erikgrinaker erikgrinaker changed the title kvnemesis: committed reverse scan non-atomic timestamps [rs(skip-locked)] kvnemesis: committed reverse scan non-atomic timestamps [range key omission] Nov 8, 2023
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 8, 2023

release-23.2 with cockroachdb/pebble@babd592d cherry-pick passed 2000/2000 runs.

@jbowens We'll need to backport this. We'd also like to understand why this began failing, and why this change fixed it. Is it relevant for previous versions?

@erikgrinaker erikgrinaker added T-storage Storage Team and removed T-kv KV Team labels Nov 8, 2023
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Nov 8, 2023
@jbowens
Copy link
Collaborator

jbowens commented Nov 8, 2023

Thanks for investigating and bisecting @erikgrinaker. Looking now. I want to understand what's happening before we backport cockroachdb/pebble@babd592d, because that change should only change iterator behavior in the presence of I/O errors. It's possible cockroachdb/pebble@babd592d only began masking the issue.

It is relevant for previous versions as well.

@erikgrinaker
Copy link
Contributor Author

I suppose it's possible that there are metamorphic test parameters that provoke it too.

@erikgrinaker
Copy link
Contributor Author

We're thinking this may have something to do with kv.snapshot_receiver.excise.enabled being metamorphically enabled.

jbowens added a commit to jbowens/cockroach that referenced this issue Nov 8, 2023
The new excise operation and eventually file-only snapshot (EFOS) features in
Pebble are experimental. Within the 23.2 release, these features must only be
enabled within the context of the disaggregated storage techncial preview.
These features are still experimental and unstable.

Remove the metamorphism of these settings in tests in order to stabilize the
23.2 release branch. As we appraoch release, we want our test failures to be
high signal, and failures dependent on these settings should not block the
release. On the master branch these settings will be enabled unconditionally
soon, and we'll get plenty of test coverage before their general availability
in 24.1.

Epic: none
Informs cockroachdb#112221.
Informs cockroachdb#113973.
Informs cockroachdb#114056.
Release note: none
jbowens added a commit to jbowens/cockroach that referenced this issue Nov 8, 2023
The new excise operation and eventually file-only snapshot (EFOS) features in
Pebble are experimental. Within the 23.2 release, these features must only be
enabled within the context of the disaggregated storage techncial preview.
These features are still experimental and unstable.

Remove the metamorphism of these settings in tests in order to stabilize the
23.2 release branch. As we appraoch release, we want our test failures to be
high signal, and failures dependent on these settings should not block the
release. On the master branch these settings will be enabled unconditionally
soon, and we'll get plenty of test coverage before their general availability
in 24.1.

Epic: none
Informs cockroachdb#112221.
Informs cockroachdb#113973.
Informs cockroachdb#114056.
Release note: none
@erikgrinaker
Copy link
Contributor Author

Unsurprisingly, this fails on every Pebble bump after 65bb6cc but not before it. This commit added kv.snapshot_receiver.excise.enabled and enabled it metamorphically. All failures are with kv.snapshot_receiver.excise.enabled=true.

We're disabling this metamorphism on release-23.2 in #114057. Still seems worth getting to the bottom of this. Leaving this to storage.

@jbowens
Copy link
Collaborator

jbowens commented Nov 8, 2023

Nice, thanks @erikgrinaker. I'm going to remove the release-blocker label since it's only reproducible under these experimental cluster settings. We'll continue to investigate.

@jbowens jbowens removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 8, 2023
@itsbilal itsbilal self-assigned this Nov 8, 2023
@itsbilal
Copy link
Member

itsbilal commented Nov 8, 2023

I dug into this with the assistance of cockroachdb/pebble#3044 . The underlying issue is that filteringIter in Pebble, which Pebble relies on heavily to truncate range deletions and range keys to make room for an excise, can break its own invariants when Seek methods are called. The bug is around here:

https://github.com/cockroachdb/pebble/blob/81d9a4fea01c84f2213cf68ccdc886dc2febc15f/internal/keyspan/filter.go#L88

Consider an sstable (00001) that contains a rangedel a-f (also works with range keys, but let's assume rangedels for now), and you excise b-d and slot in a point key at c with an IngestAndExcise(00002.sst, b-d). Your sstables will now look like this:

00003: virtual sstable pointing to 00001, bounds a-b, contains the a-f rangedel truncated to d-f
00002: ingested sstable, containing point key at c
00004: virtual sstable pointing to 00001, bounds d-f, contains the a-f rangedel truncated to d-f at read tie

Let's say 00003 and 00004 are a level above 00002. A levelIter produced out of those will see the rangedels a-b and d-f, so c should not be deleted.

However, if you did a levelIter.SeekLT(c) on that levelIter, it'll do a filteringIter.SeekLT(c) on 00004 (or will it? maybe we can strike the levelIter off of this explanation), which will find a-f, and filteringIter will truncate that to d-f and return it. Crucially, filteringIter will not realize that it returned a key strictly greater than the SeekLT seek key it was passed in, and callers will expect filteringIter obeyed its invariants when it didn't and all sorts of badness can happen.

The only fuzzy part of this explanation is I'm not sure if all callers already do additional sanity-checking on the return values of Seek calls to ensure that it actually obeyed the contract of a Seek, but in places where they don't (eg. getIter for sure, probably others), this can be a real bug

itsbilal added a commit to itsbilal/pebble that referenced this issue Nov 8, 2023
Previously if filteringIter's FilterFunc mutated the passed-in span
to no longer be a valid return value for a SeekLT or SeekGE call,
we would still return that span even though it could be >=  the seek
key (for SeekLT), or less than it (for SeekGE).

This change updates filteringIter to guard for this case before
returning from a seek call.

Found with cockroachdb#3044.

Informs cockroachdb/cockroach#113973, cockroachdb/cockroach#114056.
itsbilal added a commit to itsbilal/pebble that referenced this issue Nov 8, 2023
Previously if filteringIter's FilterFunc mutated the passed-in span
to no longer be a valid return value for a SeekLT or SeekGE call,
we would still return that span even though it could be >=  the seek
key (for SeekLT), or less than it (for SeekGE).

This change updates filteringIter to guard for this case before
returning from a seek call.

Found with cockroachdb#3044.

Informs cockroachdb/cockroach#113973, cockroachdb/cockroach#114056.
@itsbilal
Copy link
Member

itsbilal commented Nov 8, 2023

Fix for the above mentioned bug is in cockroachdb/pebble#3046 .

itsbilal added a commit to cockroachdb/pebble that referenced this issue Nov 9, 2023
Previously if filteringIter's FilterFunc mutated the passed-in span
to no longer be a valid return value for a SeekLT or SeekGE call,
we would still return that span even though it could be >=  the seek
key (for SeekLT), or less than it (for SeekGE).

This change updates filteringIter to guard for this case before
returning from a seek call.

Found with #3044.

Informs cockroachdb/cockroach#113973, cockroachdb/cockroach#114056.
itsbilal added a commit to itsbilal/pebble that referenced this issue Nov 9, 2023
Previously if filteringIter's FilterFunc mutated the passed-in span
to no longer be a valid return value for a SeekLT or SeekGE call,
we would still return that span even though it could be >=  the seek
key (for SeekLT), or less than it (for SeekGE).

This change updates filteringIter to guard for this case before
returning from a seek call.

Found with cockroachdb#3044.

Informs cockroachdb/cockroach#113973, cockroachdb/cockroach#114056.
itsbilal added a commit to itsbilal/pebble that referenced this issue Nov 9, 2023
Previously if filteringIter's FilterFunc mutated the passed-in span
to no longer be a valid return value for a SeekLT or SeekGE call,
we would still return that span even though it could be >=  the seek
key (for SeekLT), or less than it (for SeekGE).

This change updates filteringIter to guard for this case before
returning from a seek call.

Found with cockroachdb#3044.

Informs cockroachdb/cockroach#113973, cockroachdb/cockroach#114056.
itsbilal added a commit to cockroachdb/pebble that referenced this issue Nov 9, 2023
Previously if filteringIter's FilterFunc mutated the passed-in span
to no longer be a valid return value for a SeekLT or SeekGE call,
we would still return that span even though it could be >=  the seek
key (for SeekLT), or less than it (for SeekGE).

This change updates filteringIter to guard for this case before
returning from a seek call.

Found with #3044.

Informs cockroachdb/cockroach#113973, cockroachdb/cockroach#114056.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Nov 9, 2023
```
c4f530dd internal/keyspan: obey Seek invariants in filteringIter
```

Fixes cockroachdb#113973, fixes cockroachdb#114056.

Release note: None.
Release justification: Low-risk bugfix for an experimental feature
in this release.

Epic: none
itsbilal added a commit to itsbilal/cockroach that referenced this issue Nov 9, 2023
```
158dfe17 internal/keyspan: fix defragmenting iterator error handling
ee417657 internal/keyspan: consolidate datadriven iterator ops
869004e3 internal/dsl: new package
c4f530dd internal/keyspan: obey Seek invariants in filteringIter
```

Fixes cockroachdb#113973, fixes cockroachdb#114056.

Release note: None.
Release justification: Low-risk bugfix for an experimental feature in this release,
as well as general error-handling bugfix.

Epic: none
craig bot pushed a commit that referenced this issue Nov 10, 2023
114189: go.mod: bump Pebble to 8574e51347f8 r=jbowens a=itsbilal

```
8574e513 internal/keyspan: obey Seek invariants in filteringIter
```

Fixes #113973, fixes #114056.

Release note: None.

Epic: none

Co-authored-by: Bilal Akhtar <[email protected]>
@craig craig bot closed this as completed in 6c31cba Nov 10, 2023
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team
Projects
Archived in project
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants