-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: ranged intent resolution is not batched, regresses performance on TPC-C #46752
Comments
@nvanbenschoten are you planning on addressing this? |
Yes, I'm going to look into addressing it. I'm putting together a WIP fix now and I'd be interested in running it by you to see whether you like it. The high-level idea is to:
Does that line up with what you were thinking? I'm hoping to be able to verify that the fix does in fact resolve the regression in the next hour or so 🤞 |
I think the high-level pitch is reasonable. Can you help me understand the header bits a bit better. Does ResolveIntentRange use a transaction in a header? If so would we only be batching resolution for a single transaction? Most of the time we won't actually be paginating, we just need to support it for correctness, right? If a batch ends up needing to be send again to another range then we'll have to deal with that? How much of the pagination does the DistSender deal with today? |
No, it's just used to set a
Right, although there are cases that certainly will lead to pagination, like large DeleteRange requests.
This DistSender handles all of the pagination necessary for sending to ranges one-by-one when a limit is set. However, it only proceeds until |
Makes sense, thanks. Would it be better to put the loop to finish the pagination in the Would we want to set the value of |
Good news, I was just able to confirm that fixing this issue does address the regression in TPC-C. I had a small scare due to a goroutine leak that I introduced in my WIP patch, but with that fixed, things look a lot better.
I agree, I think it belongs in the
I think it can just be an option on the I'm hoping to clean up what I have and to put something out for this tomorrow. |
Fixes cockroachdb#46752. Resolves the recent perf regression on TPC-C. This commit follows in the footsteps of cockroachdb#34803 and introduces batching for ranged intent resolution, where previously only point intent resolution was batched. As we found in cockroachdb#46752, this is more important than it has been in the past, because implicit SELECT FOR UPDATE acquires a ranged lock on each row that it updates instead of a single-key lock. The change addresses this by adding a third request batcher to IntentResolver. ResolveIntent requests and ResolveIntentRange requests are batched separately, which is necessary for the use of MaxSpanRequestKeys to work properly (in fact, to be accepted by DistSender at all). To accommodate the ranged nature of ResolveIntentRange request, the notion of pagination is introduced into RequestBatcher. A MaxKeysPerBatchReq option is added to the configuration of a RequestBatcher, which corresponds to the MaxSpanRequestKeys value set on each BatchRequest header. The RequestBatcher is then taught about request pagination and how to work with partial responses. See the previous commit for clarification about the semantics at play here. Release justification: important fix to avoid a performance regression when implicit SELECT FOR UPDATE is enabled.
46952: kv: batch ranged intent resolution r=nvanbenschoten a=nvanbenschoten Fixes #46752. Resolves the recent perf regression on TPC-C. This commit follows in the footsteps of #34803 and introduces batching for ranged intent resolution, where previously only point intent resolution was batched. As we found in #46752, this is more important than it has been in the past, because implicit SELECT FOR UPDATE acquires a ranged lock on each row that it updates instead of a single-key lock. The change addresses this by adding a third request batcher to IntentResolver. ResolveIntent requests and ResolveIntentRange requests are batched separately, which is necessary for the use of MaxSpanRequestKeys to work properly (in fact, to be accepted by DistSender at all). To accommodate the ranged nature of ResolveIntentRange request, the notion of pagination is introduced into RequestBatcher. A MaxKeysPerBatchReq option is added to the configuration of a RequestBatcher, which corresponds to the MaxSpanRequestKeys value set on each BatchRequest header. The RequestBatcher is then taught about request pagination and how to work with partial responses. The semantics at play here are clarified in an earlier commit in the PR. Release justification: important fix to avoid a performance regression when implicit SELECT FOR UPDATE is enabled. All commits except the last are testing-only. The last commit is subtle but small and well-tested. @andreimatei: I assigned you because I think you know the most about `MaxSpanRequestKeys`. I'm mostly interested to get your input on the "rationalize Header.MaxSpanRequestKeys" commit (testing + comments only). Co-authored-by: Nathan VanBenschoten <[email protected]>
Fixes cockroachdb#46752. Resolves the recent perf regression on TPC-C. This commit follows in the footsteps of cockroachdb#34803 and introduces batching for ranged intent resolution, where previously only point intent resolution was batched. As we found in cockroachdb#46752, this is more important than it has been in the past, because implicit SELECT FOR UPDATE acquires a ranged lock on each row that it updates instead of a single-key lock. The change addresses this by adding a third request batcher to IntentResolver. ResolveIntent requests and ResolveIntentRange requests are batched separately, which is necessary for the use of MaxSpanRequestKeys to work properly (in fact, to be accepted by DistSender at all). To accommodate the ranged nature of ResolveIntentRange request, the notion of pagination is introduced into RequestBatcher. A MaxKeysPerBatchReq option is added to the configuration of a RequestBatcher, which corresponds to the MaxSpanRequestKeys value set on each BatchRequest header. The RequestBatcher is then taught about request pagination and how to work with partial responses. See the previous commit for clarification about the semantics at play here. Release justification: important fix to avoid a performance regression when implicit SELECT FOR UPDATE is enabled.
Over the past few weeks, we've been tracking a performance regression in TPC-C of 4-5% that occurs when implicit SELECT FOR UPDATE is enabled (see
enable_implicit_select_for_update
).One thing that became clear early in the investigation was that the total number of batches sent in the system rose by about 25% when running TPC-C with implicit SELECT FOR UPDATE enabled. This was surprising, as implicit SELECT FOR UPDATE is not supposed to cause the system to perform any extra work - we built it to only activate opportunistically when a scan was already going to be visiting a leaseholder to perform the row-fetch for an upcoming mutation. Using #46747, I was able to track down this increase in RPCs to a switch from
ResolveIntent
requests toResolveIntentRange
requests when implicit SELECT FOR UPDATE is in use. Remember that because SQL only currently issues rangedScan
requests and never pointGet
requests, implicit SFU is always acquiring ranged upgrade locks that overlap the point exclusive locks acquired by the following mutation, so we must release these withResolveIntentRange
requests.The following graph demonstrates this effect:
The graph shows two runs of TPC-C, first with
sql.defaults.implicit_select_for_update.enabled
set to true (the default) and second withsql.defaults.implicit_select_for_update.enabled
set to false. The top graph shows what we had already been seeing - with implicit SFU, we issue about 25% more batches (21k vs. 16k). The bottom graph is more interesting. It shows that with implicit SFU, we issue about 11kResolveIntent
requests and about 5kResolveIntentRange
requests. Without implicit SFU, we issue about 20kResolveIntent
requests and noResolveIntentRange
requests.So aren't we issuing more batches without implicit SFU? The answer is no, because
ResolveIntent
requests are batched together by theIntentResolver
. So even though we actually issue more requests, these requests are issued in batches of 100. However,ResolveIntentRange
requests are not batched together (see below), so each request is a separate batch. So these 5kResolveIntentRange
requests are exactly the missing 5k extra RPCs that we see with implicit SFU enabled.cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go
Lines 839 to 841 in 93cb2eb
This explains the modest regression in TPC-C performance and indicates that if we can lift this restriction on intent resolution batching, we should be able to close the gap.
The text was updated successfully, but these errors were encountered: