Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: fix error index in case of range split #113111

Merged
merged 1 commit into from
Nov 1, 2023

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Oct 26, 2023

This commit fixes the DistSender to set the correct error index in case an error is encountered after divideAndSendBatchToRanges was called recursively due to a stale range cache. In this scenario, previously we would have the error index set as if the original batch was the one right before the recursive call, which is incorrect in case that batch itself was split (e.g. because the original batch touched multiple ranges, and each sub-batch was executed in parallel). We already had the error index mapping code in place for the main code path, but we forgot to do the error index re-mapping after two recursive calls, which is now fixed.

This commit additionally pulls out the logic to set response.positions out of sendPartialBatch since there are fewer places to do that one level up.

This bug has been present for a very long time, but it seems relatively minor, so I decided to not include the release note.

Fixes: #111481.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich changed the title WIP on a regression test kvcoord: fix error index in case of range split Oct 31, 2023
@yuzefovich yuzefovich added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Oct 31, 2023
@yuzefovich yuzefovich marked this pull request as ready for review October 31, 2023 01:41
@yuzefovich yuzefovich requested a review from a team as a code owner October 31, 2023 01:41
@yuzefovich
Copy link
Member Author

yuzefovich commented Oct 31, 2023

I did briefly try the suggestion to do it in the same place where we call br.Combine, but that didn't work. I tried the following diff

--- a/pkg/kv/kvclient/kvcoord/dist_sender.go
+++ b/pkg/kv/kvclient/kvcoord/dist_sender.go
@@ -1542,6 +1542,9 @@ func (ds *DistSender) divideAndSendBatchToRanges(
                for _, responseCh := range responseChs {
                        resp := <-responseCh
                        if resp.pErr != nil {
+                               if resp.pErr.Index != nil && resp.pErr.Index.Index != -1 && resp.positions != nil {
+                                       resp.pErr.Index.Index = int32(resp.positions[resp.pErr.Index.Index])
+                               }
                                if pErr == nil {
                                        pErr = resp.pErr
                                        // Update the error's transaction with any new information from

To me the current fix seems less risky, so I chose to not investigate this further.

@yuzefovich yuzefovich force-pushed the fix-error-index branch 2 times, most recently from a5fad80 to 79fc14c Compare October 31, 2023 02:34
@yuzefovich
Copy link
Member Author

I did briefly try the suggestion to do it in the same place where we call br.Combine, but that didn't work.

That diff was missing setting resp.positions in one place, once added, things seem to work. Let me know which approach is preferable.

Also, the newly added test has a race on updating RangeCache.db field. I don't think that we want to add any synchronization around that just for the test in the production path, perhaps we could hide it behind buildutil.CrdbTestBuild? Any other ideas of going around that? We could also just skip the test under race.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know which approach is preferable.

I like the approach that you have here.

If you wanted to take it one step further, you could remove the positions param from sendPartialBatch entirely and have the callers assign it to the response before pushing on the channel. That way, it couldn't be missed.

Also, the newly added test has a race on updating RangeCache.db field. I don't think that we want to add any synchronization around that just for the test in the production path, perhaps we could hide it behind buildutil.CrdbTestBuild? Any other ideas of going around that? We could also just skip the test under race.

Did you consider using the waitThenSwitchToSplitDesc strategy that the TestDescriptorChangeAfterRequestSubdivision uses? I believe that avoids the need to synchronize on the RangeCache.db field itself.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

This commit fixes the DistSender to set the correct error index in case
an error is encountered after `divideAndSendBatchToRanges` was called
recursively due to a stale range cache. In this scenario, previously we
would have the error index set as if the original batch was the one
right before the recursive call, which is incorrect in case that batch
itself was split (e.g. because the original batch touched multiple
ranges, and each sub-batch was executed in parallel). We already had the
error index mapping code in place for the main code path, but we forgot
to do the error index re-mapping after two recursive calls, which is now
fixed.

This commit additionally pulls out the logic to set `response.positions`
out of `sendPartialBatch` since there are fewer places to do that one
level up.

This bug has been present for a very long time, but it seems relatively
minor, so I decided to not include the release note.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to take it one step further, you could remove the positions param from sendPartialBatch entirely and have the callers assign it to the response before pushing on the channel. That way, it couldn't be missed.

Nice, I like it, done.

Did you consider using the waitThenSwitchToSplitDesc strategy that the TestDescriptorChangeAfterRequestSubdivision uses? I believe that avoids the need to synchronize on the RangeCache.db field itself.

Oh, indeed, that works, thanks.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)

@yuzefovich
Copy link
Member Author

yuzefovich commented Nov 1, 2023

Thanks a lot for the pointers and the review!

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 1, 2023

Build succeeded:

@craig craig bot merged commit f515f70 into cockroachdb:master Nov 1, 2023
3 checks passed
Copy link

blathers-crl bot commented Nov 1, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 5f8e863 to blathers/backport-release-22.2-113111: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@yuzefovich yuzefovich deleted the fix-error-index branch November 1, 2023 21:21
@yuzefovich yuzefovich removed backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels Nov 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
3 participants