Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: correctly update the stats for rolled back deletes #113306

Merged
merged 1 commit into from
Nov 2, 2023

Conversation

miraradeva
Copy link
Contributor

Previously, a delete on a key rolled back by a savepoint did not correctly update the MVCCStats when the intent for that key was resolved. In particular, if the rollback revealed a non-deleted value for the key, the stats considered that key/value non-live and included other inaccuracies. This is an example from the output of kvnemesis, which caught the bug:

engine stats:
{ContainsEstimates:0 LastUpdateNanos:1698322873881949000 LockAge:0
GCBytesAge:0 LiveBytes:771 LiveCount:13 KeyBytes:1350 KeyCount:34
ValBytes:714 ValCount:53 IntentBytes:8 IntentCount:0 LockBytes:0
LockCount:0 RangeKeyCount:0 RangeKeyBytes:0 RangeValCount:0
RangeValBytes:0 SysBytes:254 SysCount:6 AbortSpanBytes:0}

computed stats:
{ContainsEstimates:0 LastUpdateNanos:1698322873881949000 LockAge:0
GCBytesAge:0 LiveBytes:161 LiveCount:1 KeyBytes:0 KeyCount:0
ValBytes:8 ValCount:0 IntentBytes:8 IntentCount:0 LockBytes:0
LockCount:0 RangeKeyCount:0 RangeKeyBytes:0 RangeValCount:0
RangeValBytes:0 SysBytes:0 SysCount:0 AbortSpanBytes:0}

This patch fixes the logic responsible for computing stats upon intent resolution and adds testing for rolled back deletes.

Informs: #97444

Release note (bug fix): Rolled back deletes no longer cause a discrepancy between computed stats and the actual stored values.

@miraradeva miraradeva requested a review from a team as a code owner October 30, 2023 18:31
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice find!

I'll add the backport labels for the releases that we'll want to backport to. I suspect that not all of them will be clean backports, unfortunately.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @miraradeva)


pkg/storage/mvcc.go line 630 at r1 (raw file):

// resolved value (key & bytes) are subtracted from the intents
// counters if commit=true.
func updateStatsOnResolve(

Can we add some coverage of this change in mvcc_stats_test.go? In particular, it would be nice to add:

  1. two unit tests for it (tombstone rollback to value + value rollback to tombstone), unless you think this is entirely redundant with pkg/storage/testdata/mvcc_histories/ignored_seq_nums_delete
  2. coverage of txn rollbacks in TestMVCCStatsRandomized

pkg/storage/mvcc.go line 713 at r1 (raw file):

	// If the value is deleted or a delete was rolled back, adjust LiveCount.
	if !meta.Deleted && orig.Deleted {

The way we usually accomplish this in these stats functions is logic that looks closer to:

// in the first half of the function
if orig.Deleted {
    ms.LiveCount++
}

// in the second half of the function
if meta.Deleted {
    ms.LiveCount--
}

That decouples the relationship between orig and meta slightly. Conveniently, these branches already exist.


pkg/storage/mvcc.go line 5090 at r1 (raw file):

		// here. Instead, this should all be handled down below.
		var removeIntent bool
		removeIntent, rolledBackVal, err = mvccMaybeRewriteIntentHistory(ctx, writer, intent.IgnoredSeqNums, newMeta, latestKey)

👍 it wasn't good that we were modifying meta. Thanks for fixing this. To confirm though, this wasn't needed to address the bug, right? (just double-checking my understanding).

@miraradeva miraradeva force-pushed the mira/incorrect-rollback-stats branch from 4175f47 to 4d482a3 Compare October 31, 2023 18:30
Copy link
Contributor Author

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @nvanbenschoten)


pkg/storage/mvcc.go line 630 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we add some coverage of this change in mvcc_stats_test.go? In particular, it would be nice to add:

  1. two unit tests for it (tombstone rollback to value + value rollback to tombstone), unless you think this is entirely redundant with pkg/storage/testdata/mvcc_histories/ignored_seq_nums_delete
  2. coverage of txn rollbacks in TestMVCCStatsRandomized

1 is all set.
2 is a bit tricky because these randomized tests have no easy way of tweaking the probabilities of different operations. With the code I added I could not repro this bug because it needs a specific set of circumstances: a non-nil txn that commits and has an rolled back delete in its history. Out of 200 operations (the test default), for example, maybe 12 even get some interesting history, and the rest doesn't line up. Any suggestions?


pkg/storage/mvcc.go line 713 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The way we usually accomplish this in these stats functions is logic that looks closer to:

// in the first half of the function
if orig.Deleted {
    ms.LiveCount++
}

// in the second half of the function
if meta.Deleted {
    ms.LiveCount--
}

That decouples the relationship between orig and meta slightly. Conveniently, these branches already exist.

Nice, I like that.


pkg/storage/mvcc.go line 5090 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

👍 it wasn't good that we were modifying meta. Thanks for fixing this. To confirm though, this wasn't needed to address the bug, right? (just double-checking my understanding).

It was actually needed; I added a comment.

If we let mvccMaybeRewriteIntentHistory change meta, then the logic for updating the stats in updateStatsOnResolve doesn't quite work because meta doesn't correspond to the original stats (ms) anymore.

For example, suppose we have an intent and a rolled back deletion tombstone in the mvcc history, such that the tombstone is the most recent value. If we pass meta to mvccMaybeRewriteIntentHistory, it will remove the tombstone from the history but will not update ms to account for that. Then, we pass orig = meta to updateStatsOnResolve and orig.Deleted is false (referring to the intent), while we want it to be true (referring to the tombstone) to be able to adjust LiveBytes and LiveCount correctly.

@miraradeva miraradeva force-pushed the mira/incorrect-rollback-stats branch 2 times, most recently from e7b22cc to 6c5489f Compare October 31, 2023 20:23
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @miraradeva)


pkg/storage/mvcc.go line 630 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

1 is all set.
2 is a bit tricky because these randomized tests have no easy way of tweaking the probabilities of different operations. With the code I added I could not repro this bug because it needs a specific set of circumstances: a non-nil txn that commits and has an rolled back delete in its history. Out of 200 operations (the test default), for example, maybe 12 even get some interesting history, and the rest doesn't line up. Any suggestions?

I don't think it needs to be able to reproduce the bug with a high probability, but making that probability non-zero would give us more confidence that we have coverage. That would ensure that stress over enough time would catch bugs in these cases.


pkg/storage/mvcc.go line 5090 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

It was actually needed; I added a comment.

If we let mvccMaybeRewriteIntentHistory change meta, then the logic for updating the stats in updateStatsOnResolve doesn't quite work because meta doesn't correspond to the original stats (ms) anymore.

For example, suppose we have an intent and a rolled back deletion tombstone in the mvcc history, such that the tombstone is the most recent value. If we pass meta to mvccMaybeRewriteIntentHistory, it will remove the tombstone from the history but will not update ms to account for that. Then, we pass orig = meta to updateStatsOnResolve and orig.Deleted is false (referring to the intent), while we want it to be true (referring to the tombstone) to be able to adjust LiveBytes and LiveCount correctly.

Ah, that makes sense! Thanks for explaining.

@miraradeva miraradeva force-pushed the mira/incorrect-rollback-stats branch 2 times, most recently from d5814d6 to ffd0bac Compare November 2, 2023 14:48
Copy link
Contributor Author

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi and @nvanbenschoten)


pkg/storage/mvcc.go line 630 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think it needs to be able to reproduce the bug with a high probability, but making that probability non-zero would give us more confidence that we have coverage. That would ensure that stress over enough time would catch bugs in these cases.

There was a bug in the test. It was using the state's ts (s.TS) even when an operation was running transactionally. Those were failing silently because the op ts didn't match the txn's read ts. Now, the test repros the bug nicely.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi)

Previously, a delete on a key rolled back by a savepoint did not
correctly update the MVCCStats when the intent for that key was
resolved. In particular, if the rollback revealed a non-deleted value
for the key, the stats considered that key/value non-live and included
other inaccuracies. This is an example from the output of kvnemesis,
which caught the bug:

```
engine stats:
{ContainsEstimates:0 LastUpdateNanos:1698322873881949000 LockAge:0
GCBytesAge:0 LiveBytes:771 LiveCount:13 KeyBytes:1350 KeyCount:34
ValBytes:714 ValCount:53 IntentBytes:8 IntentCount:0 LockBytes:0
LockCount:0 RangeKeyCount:0 RangeKeyBytes:0 RangeValCount:0
RangeValBytes:0 SysBytes:254 SysCount:6 AbortSpanBytes:0}

computed stats:
{ContainsEstimates:0 LastUpdateNanos:1698322873881949000 LockAge:0
GCBytesAge:0 LiveBytes:161 LiveCount:1 KeyBytes:0 KeyCount:0
ValBytes:8 ValCount:0 IntentBytes:8 IntentCount:0 LockBytes:0
LockCount:0 RangeKeyCount:0 RangeKeyBytes:0 RangeValCount:0
RangeValBytes:0 SysBytes:0 SysCount:0 AbortSpanBytes:0}
```

This patch fixes the logic responsible for computing stats upon intent
resolution and adds testing for rolled back deletes.

Informs: cockroachdb#97444

Release note (bug fix): Rolled back deletes no longer cause a
discrepancy between computed stats and the actual stored values.
@miraradeva miraradeva force-pushed the mira/incorrect-rollback-stats branch from ffd0bac to 2ce4ebf Compare November 2, 2023 17:29
@nvanbenschoten nvanbenschoten added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Nov 2, 2023
@miraradeva
Copy link
Contributor Author

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Nov 2, 2023

Build succeeded:

@craig craig bot merged commit a6f98c8 into cockroachdb:master Nov 2, 2023
3 checks passed
Copy link

blathers-crl bot commented Nov 2, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 2ce4ebf to blathers/backport-release-22.2-113306: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


error creating merge commit from 2ce4ebf to blathers/backport-release-23.1-113306: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


error creating merge commit from 2ce4ebf to blathers/backport-release-23.2-113306: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nvanbenschoten
Copy link
Member

@miraradeva it looks like none of these backports were clean, so we'll need to manually backport them to each release branch. The instructions for that are here: https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/900005932/Backporting+a+change+to+a+release+branch

@miraradeva
Copy link
Contributor Author

@miraradeva it looks like none of these backports were clean, so we'll need to manually backport them to each release branch. The instructions for that are here: https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/900005932/Backporting+a+change+to+a+release+branch

Yup, I'm on it.

miraradeva added a commit to miraradeva/cockroach that referenced this pull request Apr 24, 2024
Recently, cockroachdb#113306 attempted to fix an issue in TestMVCCStatsRandomized
where a transactional request was silently failing because its
timestamp did not match the transaction's read timestamp. That patch
handled many of the types of requests in TestMVCCStatsRandomized but
not all (e.g. transactional GC requests).

This patch addresses the issue by refactoring the timestamp manipulation
to a single place in the test.

Fixes: cockroachdb#119409

Release note: None
craig bot pushed a commit that referenced this pull request Apr 24, 2024
122976: storage: fix TestMVCCStatsRandomized's handling of timestamps r=miraradeva a=miraradeva

Recently, #113306 attempted to fix an issue in TestMVCCStatsRandomized where a transactional request was silently failing because its timestamp did not match the transaction's read timestamp. That patch handled many of the types of requests in TestMVCCStatsRandomized but not all (e.g. transactional GC requests).

This patch addresses the issue by refactoring the timestamp manipulation to a single place in the test.

Fixes: #119409

Release note: None

Co-authored-by: Mira Radeva <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Apr 24, 2024
Recently, #113306 attempted to fix an issue in TestMVCCStatsRandomized
where a transactional request was silently failing because its
timestamp did not match the transaction's read timestamp. That patch
handled many of the types of requests in TestMVCCStatsRandomized but
not all (e.g. transactional GC requests).

This patch addresses the issue by refactoring the timestamp manipulation
to a single place in the test.

Fixes: #119409

Release note: None
miraradeva added a commit to miraradeva/cockroach that referenced this pull request Apr 24, 2024
Recently, cockroachdb#113306 attempted to fix an issue in TestMVCCStatsRandomized
where a transactional request was silently failing because its
timestamp did not match the transaction's read timestamp. That patch
handled many of the types of requests in TestMVCCStatsRandomized but
not all (e.g. transactional GC requests).

This patch addresses the issue by refactoring the timestamp manipulation
to a single place in the test.

Fixes: cockroachdb#119409

Release note: None
miraradeva added a commit to miraradeva/cockroach that referenced this pull request Apr 25, 2024
Recently, cockroachdb#113306 attempted to fix an issue in TestMVCCStatsRandomized
where a transactional request was silently failing because its
timestamp did not match the transaction's read timestamp. That patch
handled many of the types of requests in TestMVCCStatsRandomized but
not all (e.g. transactional GC requests).

This patch addresses the issue by refactoring the timestamp manipulation
to a single place in the test.

Fixes: cockroachdb#119409

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants