-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batcheval: ExportRequest does not return to the client when over the ElasticCPU limit if TargetBytes is unset #96684
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Comments
adityamaru
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-disaster-recovery
labels
Feb 6, 2023
cc @cockroachdb/disaster-recovery |
adityamaru
changed the title
batcheval: ExportRequest does not always return to the client when over the ElasticCPU limit
batcheval: ExportRequest does not return to the client when over the ElasticCPU limit if TargetBytes is unset
Feb 6, 2023
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Feb 9, 2023
Previously, there was a strange coupling between the elastic CPU limiter and the `header.TargetBytes` DistSender limit set on each ExportRequest. Even if a request was preempted on exhausting its allotted CPU tokens, it would only return from kvserver by virtue of its `header.TargetBytes` being set to a non-zero value. Out of the four users of ExportRequest, only backup set this field to a sentinel value of 1 to limit the number of SSTs we send back in an ExportResponse. The remaining callers of ExportRequest would not return from the kvserver. Instead they would evaluate the request from the resume key immediately, not giving the scheduler a chance to take the goroutine off CPU. This change breaks this coupling by introducing a `resumeInfo` object that indicates whether the resumption was because we were over our CPU limit. If it was, we return an ExportResponse with our progress so far. This change shifts the burden of handling pagination to the client. This seems better than having the server sleep or wait around until its CPU tokens are replenished as the client would be left wondering why a request is taking so long. To that effect this change adds pagination support to the other callers of ExportRequest. Note, we do not set `SplitMidKey` at these other callsites yet. Thus, all pagination will happen at key boundaries in the ExportRequest. A follow-up will add support for `SplitMidKey` to these callers. Informs: cockroachdb#96684 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Feb 24, 2023
Previously, there was a strange coupling between the elastic CPU limiter and the `header.TargetBytes` DistSender limit set on each ExportRequest. Even if a request was preempted on exhausting its allotted CPU tokens, it would only return from kvserver by virtue of its `header.TargetBytes` being set to a non-zero value. Out of the four users of ExportRequest, only backup set this field to a sentinel value of 1 to limit the number of SSTs we send back in an ExportResponse. The remaining callers of ExportRequest would not return from the kvserver. Instead they would evaluate the request from the resume key immediately, not giving the scheduler a chance to take the goroutine off CPU. This change breaks this coupling by introducing a `resumeInfo` object that indicates whether the resumption was because we were over our CPU limit. If it was, we return an ExportResponse with our progress so far. This change shifts the burden of handling pagination to the client. This seems better than having the server sleep or wait around until its CPU tokens are replenished as the client would be left wondering why a request is taking so long. To that effect this change adds pagination support to the other callers of ExportRequest. Note, we do not set `SplitMidKey` at these other callsites yet. Thus, all pagination will happen at key boundaries in the ExportRequest. A follow-up will add support for `SplitMidKey` to these callers. Informs: cockroachdb#96684 Release note: None
craig bot
pushed a commit
that referenced
this issue
Feb 25, 2023
96691: *: enables elastic CPU limiter for all users of ExportRequest r=stevendanna a=adityamaru Previously, there was a strange coupling between the elastic CPU limiter and the `header.TargetBytes` DistSender limit set on each ExportRequest. Even if a request was preempted on exhausting its allotted CPU tokens, it would only return from kvserver by virtue of its `header.TargetBytes` being set to a non-zero value. Out of the four users of ExportRequest, only backup set this field to a sentinel value of 1 to limit the number of SSTs we send back in an ExportResponse. The remaining callers of ExportRequest would not return from the kvserver. Instead they would evaluate the request from the resume key immediately, not giving the scheduler a chance to take the goroutine off CPU. This change breaks this coupling by introducing a `resumeInfo` object that indicates whether the resumption was because we were over our CPU limit. If it was, we return an ExportResponse with our progress so far. This change shifts the burden of handling pagination to the client. This seems better than having the server sleep or wait around until its CPU tokens are replenished as the client would be left wondering why a request is taking so long. To that effect this change adds pagination support to the other callers of ExportRequest. Note, we do not set `SplitMidKey` at these other callsites yet. Thus, all pagination will happen at key boundaries in the ExportRequest. A follow-up will add support for `SplitMidKey` to these callers. Informs: #96684 Release note: None Co-authored-by: adityamaru <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Recently, we added an elastic CPU limiter to
mvccExportToWriter
. This limiter checks if an ExportRequest is over its allotted CPU tokens, and if it is, returns from the method with a resume key. This return frommvccExportToWriter
is expected to result in the return of an ExportResponse (along with the resume key) to the client. Signaling the RPC to return is a way of allowing the scheduler to take the goroutine off the CPU, thereby allowing other processes waiting on the CPU to be admitted.Today, this return to the client is conditional on us breaking out from this loop
cockroach/pkg/kv/kvserver/batcheval/cmd_export.go
Lines 172 to 307 in 260f4fa
break
in the loop is however inside this conditionalcockroach/pkg/kv/kvserver/batcheval/cmd_export.go
Line 268 in 260f4fa
TargetBytes
. ExportRequests are not always sent withTargetBytes
set and so even if the elastic limiter indicates we are over the CPU limit, we would not break from this loop and immediately retry the same export. This would likely lead to excessive thrashing since the scheduler is not being given a chance to offload the goroutine before retrying. There is no correlation betweenTargetBytes
being set and paginating because of a CPU limit and so we should not rely on thebreak
inside the conditional.Jira issue: CRDB-24277
The text was updated successfully, but these errors were encountered: