Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: refresh range cache on rangefeed barrier failure #119512

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Feb 22, 2024

The DistSender does not refresh its range cache for unsplittable requests. This could cause a rangefeed transaction pusher barrier request to persistently fail following a range merge if the range cache thought the barrier spanned multiple ranges. This would only resolve once the range cache was refreshed by some other request, which might never happen. This in turn would cause the rangefeed's resolved timestamp to stall.

Resolves #119536.
Resolves #119333.
Epic: none
Release note (bug fix): fixed a bug where rangefeed resolved timestamps could get stuck, continually emitting the log message "pushing old intents failed: range barrier failed, range split", typically following a range merge.

@erikgrinaker erikgrinaker requested review from nvanbenschoten and a team February 22, 2024 12:28
@erikgrinaker erikgrinaker self-assigned this Feb 22, 2024
Copy link

blathers-crl bot commented Feb 22, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker changed the title kvserver: refresh range cache on rangefeed barrier kvserver: refresh range cache via Get on rangefeed barrier failure Feb 22, 2024
@erikgrinaker erikgrinaker force-pushed the distsender-rangefeed-barrier-refresh branch from 0c7779e to 8eb5fc2 Compare February 22, 2024 17:57
@erikgrinaker erikgrinaker changed the title kvserver: refresh range cache via Get on rangefeed barrier failure kvserver: refresh range cache on rangefeed barrier failure Feb 22, 2024
@erikgrinaker erikgrinaker added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. backport-22.2.x labels Feb 22, 2024
@erikgrinaker
Copy link
Contributor Author

As far as I can tell, this should do the right thing. I'm going to add a regression test before marking as ready, but feel free to have an early look @nvanbenschoten.

@erikgrinaker
Copy link
Contributor Author

Kicked off another 10 runs of c2c/tpcc/warehouses=1000/duration=60/cutover=30.

@erikgrinaker erikgrinaker force-pushed the distsender-rangefeed-barrier-refresh branch from 8eb5fc2 to 8caed65 Compare February 22, 2024 19:10
@erikgrinaker erikgrinaker marked this pull request as ready for review February 22, 2024 19:12
@erikgrinaker erikgrinaker requested a review from a team February 22, 2024 19:12
@erikgrinaker erikgrinaker force-pushed the distsender-rangefeed-barrier-refresh branch from 8caed65 to 857229a Compare February 22, 2024 19:21
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regression test in place, should be good to go.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/replica_rangefeed.go line 213 at r2 (raw file):

		//
		// TODO(erikgrinaker): the DistSender should refresh its cache instead.
		if _, err := tp.r.store.db.Get(ctx, tp.span.Key); err != nil {

Are we cool with using a Get here? Should we consider something else like a LeaseInfo?

Only reason to do so would be to avoid skewing metrics or something.

@erikgrinaker erikgrinaker force-pushed the distsender-rangefeed-barrier-refresh branch 2 times, most recently from 3d3fdd7 to bc60d81 Compare February 22, 2024 20:01
@erikgrinaker
Copy link
Contributor Author

Kicked off another 10 runs of c2c/tpcc/warehouses=1000/duration=60/cutover=30.

Passed, doing another 20 runs with the updated PR.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 3 files at r1, 1 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/replica_rangefeed.go line 213 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Are we cool with using a Get here? Should we consider something else like a LeaseInfo?

Only reason to do so would be to avoid skewing metrics or something.

A Get request seems safer to me, at least in the backport.


pkg/kv/kvserver/replica_rangefeed_test.go line 1741 at r3 (raw file):

	_, _, err = db3.BarrierWithLAI(ctx, span.Key, span.EndKey)
	require.Error(t, err)
	require.IsType(t, &kvpb.RangeKeyMismatchError{}, err)

This feels potentially flaky, as we're asserting that the cache has not been updated. I guess we can wait and see, but consider at least leaving a note to a future debugger saying that if this line fails, it can be removed without undermining the purpose of the test.

@erikgrinaker
Copy link
Contributor Author

The stress test is failing on some other rangefeed test, but I can't figure out why -- it doesn't really say anything in the failure log or the server logs:

=== RUN   TestNewRangefeedForceLeaseRetry
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestNewRangefeedForceLeaseRetry3968364281
    test_log_scope.go:81: use -show-logs to present logs inline
    replica_rangefeed_test.go:1638: kv/kvserver_test/replica_rangefeed_test.go:1551: test injected error
    panic.go:523: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestNewRangefeedForceLeaseRetry3968364281
--- FAIL: TestNewRangefeedForceLeaseRetry (0.71s)

I'm going to assume this is a flake and start a bors run while I dig into the log dump.

The DistSender does not refresh its range cache for unsplittable
requests. This could cause a rangefeed transaction pusher barrier
request to persistently fail following a range merge if the range cache
thought the barrier spanned multiple ranges. This would only resolve
once the range cache was refreshed by some other request, which might
never happen. This in turn would cause the rangefeed's resolved
timestamp to stall.

Epic: none
Release note (bug fix): fixed a bug where rangefeed resolved timestamps
could get stuck, continually emitting the log message "pushing old
intents failed: range barrier failed, range split", typically following
a range merge.
@erikgrinaker erikgrinaker force-pushed the distsender-rangefeed-barrier-refresh branch from bc60d81 to 0698937 Compare February 22, 2024 21:08
@erikgrinaker
Copy link
Contributor Author

bors r+

I'm going to assume this is a flake and start a bors run while I dig into the log dump.

We've seen that flake in #92451 too.

@craig
Copy link
Contributor

craig bot commented Feb 22, 2024

Build succeeded:

@craig craig bot merged commit 9d55e4b into cockroachdb:master Feb 22, 2024
12 of 15 checks passed
Copy link

blathers-crl bot commented Feb 22, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 0698937 to blathers/backport-release-22.2-119512: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


error creating merge commit from 0698937 to blathers/backport-release-23.1-119512: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
3 participants