-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl,kvserver: log failed ExportRequest trace on client and server #102793
backupccl,kvserver: log failed ExportRequest trace on client and server #102793
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
@arulajmani not sure how much you like this approach but as discussed offline for remotely served ExportRequests DistSender does not wait for the batch response before returning an error to the client. For this reason, the trace emitted on the client will only contain calls upto Let me know if you think its useful to push forward with this diff. Some experimentation in roachprod shows that if we always send ExportRequests as verbose then for requests that take longer than the timeout we see the following log line, which is definitely more than what we have today. A structured trace would give us less that what is pasted below, but would be a good starting point. I want to leave the switch to verbose tracing spans to a follow up that is done with some performance testing of different backups.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, this is sick! I have no opinions on the approach to be honest.
I want to leave the switch to verbose tracing spans to a follow up that is done with some performance testing of different backups.
Consider putting it behind a cluster setting, defaulted to false. That way, we can reach for verbose tracing when running experiments/or whatever else.
Reviewable status: complete! 0 of 0 LGTMs obtained
8ec9cf9
to
1538094
Compare
I changed the error wrapping to a log line just because this error is stored in the jobs table and is surfaced on the console, so having a big trace dumped in the error doesn't look great. The only part I'm antsy about is the change to |
#103034 will make the use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but you could just inline what you're doing in SendWrappedWithAdmissionTraced like the comment below suggests.
1538094
to
43ad7ef
Compare
This change strives to improve observability around backups that fail because of timed out ExportRequests. Currently, there is very little indication of what the request was doing when the client cancelled the context after the pre-determined timeout window. With this change we now log the trace of the ExportRequest that failed. If the ExportRequest was served locally, then the trace will be part of the sender's tracing span. However, if the request was served on a remote node then the sender does not wait for the server side evaluation to notice the context cancellation. To work around this, we also print the trace on the server side if the request encountered a context cancellation and the associating tracing span is not a noop. This change also adds a private cluster setting `bulkio.backup.export_request_verbose_tracing` that if set to true will send all backup export requests with verbose tracing mode. To debug a backup failing with a timed out export request we can now: - SET CLUSTER SETTING bulkio.backup.export_request_verbose_tracing = true; - SET CLUSTER SETTING trace.snapshot.rate = '1m' Once the backup times out we can look at the logs for the server side and client side ExportRequest traces to then understand where we were stuck executing for so long. Fixes: cockroachdb#86047 Release note: None
43ad7ef
to
d59a111
Compare
Flake is #99261 bors r=irfansharif |
Build failed (retrying...): |
Build succeeded: |
@@ -453,6 +471,9 @@ func runBackupProcessor( | |||
// TimeoutError improves the opaque `context deadline exceeded` error | |||
// message so use that instead. | |||
if errors.HasType(exportRequestErr, (*contextutil.TimeoutError)(nil)) { | |||
if recording != nil { | |||
log.Errorf(ctx, "failed export request for span %s\n trace:\n%s", span.span, recording) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and below: unnecessary space before trace
.
|
||
sendExportRequestWithVerboseTracing = settings.RegisterBoolSetting( | ||
settings.TenantWritable, | ||
"bulkio.backup.export_request_verbose_tracing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this name was validated by the test TestLintClusterSettingNames
in pkg/sql/show_test.go
? It's not valid per that linter. Ditto the _delay
one above.
blathers backport 23.1 22.2 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from d59a111 to blathers/backport-release-22.2-102793: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
In #102793 we added server side logging for batch requests that would encounted a context cancelation or timeout. This was primarily motivated by the need to understand why export requests sent during a backup were spending most of their time and timing out. This unintentionally added log spam for other kinds of internal requests such as HeartbeatTxn, QueryTxn, and EndTxn requests. This change limits this log line to only be printed when the corresponding request is an export request. Fixes: None Release note: None
115098: server: reduce log spam on canceled batch requests r=erikgrinaker a=adityamaru In #102793 we added server side logging for batch requests that would encounted a context cancelation or timeout. This was primarily motivated by the need to understand why export requests sent during a backup were spending most of their time and timing out. This unintentionally added log spam for other kinds of internal requests such as HeartbeatTxn, QueryTxn, and EndTxn requests. This change limits this log line to only be printed when the corresponding request is an export request. Fixes: None Epic: none Release note: None Co-authored-by: adityamaru <[email protected]>
In #102793 we added server side logging for batch requests that would encounted a context cancelation or timeout. This was primarily motivated by the need to understand why export requests sent during a backup were spending most of their time and timing out. This unintentionally added log spam for other kinds of internal requests such as HeartbeatTxn, QueryTxn, and EndTxn requests. This change limits this log line to only be printed when the corresponding request is an export request. Fixes: None Release note: None
This change strives to improve observability around
backups that fail because of timed out ExportRequests.
Currently, there is very little indication of what the request
was doing when the client cancelled the context after
the pre-determined timeout window. With this change we
now log the trace of the ExportRequest that failed. If
the ExportRequest was served locally, then the trace will be
part of the sender's tracing span. However, if the request
was served on a remote node then the sender does not wait
for the server side evaluation to notice the context cancellation.
To work around this, we also print the trace on the server side
if the request encountered a context cancellation and the
associating tracing span is not a noop.
This change also adds a private cluster setting
bulkio.backup.export_request_verbose_tracing
that if set to truewill send all backup export requests with verbose tracing
mode.
To debug a backup failing with a timed out export request we
can now:
Once the backup times out we can look at the logs
for the server side and client side ExportRequest traces
to then understand where we were stuck executing for so long.
Fixes: #86047
Release note: None