Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl,kvserver: log failed ExportRequest trace on client and server #102793

Merged
merged 1 commit into from
May 15, 2023

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented May 5, 2023

This change strives to improve observability around
backups that fail because of timed out ExportRequests.
Currently, there is very little indication of what the request
was doing when the client cancelled the context after
the pre-determined timeout window. With this change we
now log the trace of the ExportRequest that failed. If
the ExportRequest was served locally, then the trace will be
part of the sender's tracing span. However, if the request
was served on a remote node then the sender does not wait
for the server side evaluation to notice the context cancellation.
To work around this, we also print the trace on the server side
if the request encountered a context cancellation and the
associating tracing span is not a noop.

This change also adds a private cluster setting
bulkio.backup.export_request_verbose_tracing that if set to true
will send all backup export requests with verbose tracing
mode.

To debug a backup failing with a timed out export request we
can now:

  • SET CLUSTER SETTING bulkio.backup.export_request_verbose_tracing = true;
  • SET CLUSTER SETTING trace.snapshot.rate = '1m'

Once the backup times out we can look at the logs
for the server side and client side ExportRequest traces
to then understand where we were stuck executing for so long.

Fixes: #86047
Release note: None

@blathers-crl
Copy link

blathers-crl bot commented May 5, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru
Copy link
Contributor Author

adityamaru commented May 5, 2023

@arulajmani not sure how much you like this approach but as discussed offline for remotely served ExportRequests DistSender does not wait for the batch response before returning an error to the client. For this reason, the trace emitted on the client will only contain calls upto DistSender.Send. To work around this I've added a server side log line in case the BatchRequest encounters a context cancellation. This way if an ExportRequest were to timeout we'd atleast have the trace for that request in the logs to understand why it timed out.

Let me know if you think its useful to push forward with this diff. Some experimentation in roachprod shows that if we always send ExportRequests as verbose then for requests that take longer than the timeout we see the following log line, which is definitely more than what we have today. A structured trace would give us less that what is pasted below, but would be a good starting point. I want to leave the switch to verbose tracing spans to a follow up that is done with some performance testing of different backups.

adityamaru-test-0003> W230504 22:13:50.766535 4269 server/node.go:1227 ⋮ [T1,n3] 237 request: ‹1 Export›
error: ‹context canceled›
trace: ‹     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch _unfinished:1 _verbose:1 node:3 span.kind:server request:Export [/Table/104/1,/Table/104/2), [wait-policy: Error]›
‹     0.000ms      0.000ms    [evalExport: {count: 1, duration 40s}]›
‹     0.000ms      0.000ms    [storage.MVCCExportToSST: {count: 1, duration 40s}]›
‹     0.039ms      0.039ms    event:server/node.go:1171 [n3] node received request: 1 Export›
‹     0.122ms      0.083ms    event:kv/kvserver/store_send.go:154 [n3,s3] executing Export [/Table/104/1,/Table/104/2), [wait-policy: Error]›
‹     0.159ms      0.037ms    event:kv/kvserver/replica_send.go:180 [n3,s3,r64/3:/{Table/104-Max}] read-only path›
‹     0.184ms      0.025ms    event:kv/kvserver/concurrency/concurrency_manager.go:195 [n3,s3,r64/3:/{Table/104-Max}] sequencing request›
‹     0.198ms      0.014ms    event:kv/kvserver/concurrency/concurrency_manager.go:276 [n3,s3,r64/3:/{Table/104-Max}] acquiring latches›
‹     0.217ms      0.020ms    event:kv/kvserver/concurrency/concurrency_manager.go:320 [n3,s3,r64/3:/{Table/104-Max}] scanning lock table for conflicting locks›
‹     0.257ms      0.039ms    event:kv/kvserver/replica_follower_read.go:126 [n3,s3,r64/3:/{Table/104-Max}] serving via follower read; query timestamp below closed timestamp by 1.71932036s›
‹     0.333ms      0.076ms    event:kv/kvserver/replica_read.go:297 [n3,s3,r64/3:/{Table/104-Max}] can drop latches early for batch (Export [/Table/104/1,/Table/104/2), [wait-policy: Error]); scanning lock table first to detect conflicts›
‹     0.399ms      0.066ms    event:kv/kvserver/replica_read.go:114 [n3,s3,r64/3:/{Table/104-Max}] lock table scan complete without conflicts; dropping latches early›
‹     0.415ms      0.016ms    event:kv/kvserver/replica_read.go:409 [n3,s3,r64/3:/{Table/104-Max}] executing read-only batch›
‹     0.422ms      0.007ms        === operation:evalExport _verbose:1 node:3 store:3 range:64/3:/{Table/104-Max}›
‹     0.422ms      0.000ms        [storage.MVCCExportToSST: {count: 1, duration 40s}]›
‹     0.427ms      0.005ms            === operation:storage.MVCCExportToSST _verbose:1 node:3 store:3 range:64/3:/{Table/104-Max}›
‹ 40001.896ms  40001.474ms        event:kv/kvserver/batcheval/cmd_export.go:243 [n3,s3,r64/3:/{Table/104-Max}] the error we are getting is context canceled›
‹ 40002.107ms  40001.692ms    event:kv/kvserver/replica_evaluate.go:550 [n3,s3,r64/3:/{Table/104-Max}] evaluated Export command header:<kvnemesis_seq:<> key:"\360\211" end_key:"\360\212" > resume_key_ts:<> start_time:<> split_mid_key:true target_file_size:16777216 fingerprint_options:<> , txn=<nil> : resp=header:<> start_time:<> , err=context canceled›
‹ 40002.192ms      0.085ms    event:kv/kvserver/replica_read.go:217 [n3,s3,r64/3:/{Table/104-Max}] context canceled›
‹ 40002.241ms      0.049ms    event:kv/kvserver/replica_send.go:199 [n3,s3,r64/3:/{Table/104-Max}] replica.Send got error: context canceled›
‹ 40002.274ms      0.033ms    event:server/node.go:1204 [n3] error from stores.Send: context canceled›
‹ 40002.481ms      0.207ms    event:stack as of 21.3s ago: goroutine 4269 [sleep]:›
‹time.Sleep(0x9502f9000)›
‹       GOROOT/src/runtime/time.go:195 +0x135›
‹github.com/cockroachdb/cockroach/pkg/storage.mvccExportToWriter({0x74c48b0, 0xc004a615c0}, {0x7505f10, 0xc005618600}, {{{0xc00a086a08, 0x2, 0x8}, {0x0, 0x0, 0x0}}, ...}, ...)›
‹       github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:6519 +0x399›
‹github.com/cockroachdb/cockroach/pkg/storage.MVCCExportToSST({0x74c48b0?, 0xc004a61560?}, 0x754ea90?, {0x7505f10, 0xc005618600}, {{{0xc00a086a08, 0x2, 0x8}, {0x0, 0x0, ...}}, ...}, ...)›
‹       github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:6349 +0x2a5›
‹github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.evalExport({_, _}, {_, _}, {{0x75632b0, 0xc003bf1500}, {{0x175c1064a34626bb, 0x0, 0x0}, 0x0, ...}, ...}, ...)›
‹       github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/cmd_export.go:232 +0xb36›
‹github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand({_, _}, {_, _}, {_, _}, _, _, {{0x175c1064a34626bb, 0x0, ...}, ...}, ...)›
‹       github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:528 +0x4c7›
‹github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch({_, _}, {_, _}, {_, _}, {_, _}, _, 0xc004acf7a0, ...)›
‹       github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:294 +0xea5›

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, this is sick! I have no opinions on the approach to be honest.

I want to leave the switch to verbose tracing spans to a follow up that is done with some performance testing of different backups.

Consider putting it behind a cluster setting, defaulted to false. That way, we can reach for verbose tracing when running experiments/or whatever else.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@adityamaru adityamaru force-pushed the print-export-request-trace branch from 8ec9cf9 to 1538094 Compare May 5, 2023 21:11
@adityamaru adityamaru changed the title [WIP, DNM]: print batch request trace on server side context cancellation backupccl,kvserver: log failed ExportRequest trace on client and server May 5, 2023
@adityamaru
Copy link
Contributor Author

I changed the error wrapping to a log line just because this error is stored in the jobs table and is surfaced on the console, so having a big trace dumped in the error doesn't look great.

The only part I'm antsy about is the change to node.go so let's loop in a couple more folks that might have opinions.

@adityamaru adityamaru marked this pull request as ready for review May 5, 2023 21:13
@adityamaru adityamaru requested review from a team as code owners May 5, 2023 21:13
@adityamaru adityamaru requested review from a team, lidorcarmel, stevendanna and irfansharif and removed request for a team and lidorcarmel May 5, 2023 21:13
pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved
pkg/kv/sender.go Outdated Show resolved Hide resolved
pkg/server/node.go Show resolved Hide resolved
@adityamaru
Copy link
Contributor Author

#103034 will make the use of log.Error correct

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but you could just inline what you're doing in SendWrappedWithAdmissionTraced like the comment below suggests.

@adityamaru adityamaru force-pushed the print-export-request-trace branch from 1538094 to 43ad7ef Compare May 11, 2023 20:42
This change strives to improve observability around
backups that fail because of timed out ExportRequests.
Currently, there is very little indication of what the request
was doing when the client cancelled the context after
the pre-determined timeout window. With this change we
now log the trace of the ExportRequest that failed. If
the ExportRequest was served locally, then the trace will be
part of the sender's tracing span. However, if the request
was served on a remote node then the sender does not wait
for the server side evaluation to notice the context cancellation.
To work around this, we also print the trace on the server side
if the request encountered a context cancellation and the
associating tracing span is not a noop.

This change also adds a private cluster setting
`bulkio.backup.export_request_verbose_tracing` that if set to true
will send all backup export requests with verbose tracing
mode.

To debug a backup failing with a timed out export request we
can now:
- SET CLUSTER SETTING bulkio.backup.export_request_verbose_tracing = true;
- SET CLUSTER SETTING trace.snapshot.rate = '1m'

Once the backup times out we can look at the logs
for the server side and client side ExportRequest traces
to then understand where we were stuck executing for so long.

Fixes: cockroachdb#86047
Release note: None
@adityamaru adityamaru force-pushed the print-export-request-trace branch from 43ad7ef to d59a111 Compare May 15, 2023 16:14
@adityamaru
Copy link
Contributor Author

Flake is #99261

bors r=irfansharif

@craig
Copy link
Contributor

craig bot commented May 15, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented May 15, 2023

Build succeeded:

@craig craig bot merged commit 7b86c7c into cockroachdb:master May 15, 2023
@@ -453,6 +471,9 @@ func runBackupProcessor(
// TimeoutError improves the opaque `context deadline exceeded` error
// message so use that instead.
if errors.HasType(exportRequestErr, (*contextutil.TimeoutError)(nil)) {
if recording != nil {
log.Errorf(ctx, "failed export request for span %s\n trace:\n%s", span.span, recording)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and below: unnecessary space before trace.


sendExportRequestWithVerboseTracing = settings.RegisterBoolSetting(
settings.TenantWritable,
"bulkio.backup.export_request_verbose_tracing",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this name was validated by the test TestLintClusterSettingNames in pkg/sql/show_test.go? It's not valid per that linter. Ditto the _delay one above.

@adityamaru
Copy link
Contributor Author

blathers backport 23.1 22.2

@blathers-crl
Copy link

blathers-crl bot commented Jun 1, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from d59a111 to blathers/backport-release-22.2-102793: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@adityamaru adityamaru deleted the print-export-request-trace branch June 8, 2023 18:57
craig bot pushed a commit that referenced this pull request Nov 27, 2023
In #102793 we added server side logging for batch requests
that would encounted a context cancelation or timeout. This
was primarily motivated by the need to understand why export
requests sent during a backup were spending most of their time
and timing out. This unintentionally added log spam for other
kinds of internal requests such as HeartbeatTxn, QueryTxn,
and EndTxn requests. This change limits this log line to only
be printed when the corresponding request is an export request.

Fixes: None
Release note: None
craig bot pushed a commit that referenced this pull request Nov 27, 2023
115098: server: reduce log spam on canceled batch requests r=erikgrinaker a=adityamaru

In #102793 we added server side logging for batch requests that would encounted a context cancelation or timeout. This was primarily motivated by the need to understand why export requests sent during a backup were spending most of their time and timing out. This unintentionally added log spam for other kinds of internal requests such as HeartbeatTxn, QueryTxn, and EndTxn requests. This change limits this log line to only be printed when the corresponding request is an export request.

Fixes: None
Epic: none
Release note: None

Co-authored-by: adityamaru <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Nov 27, 2023
In #102793 we added server side logging for batch requests
that would encounted a context cancelation or timeout. This
was primarily motivated by the need to understand why export
requests sent during a backup were spending most of their time
and timing out. This unintentionally added log spam for other
kinds of internal requests such as HeartbeatTxn, QueryTxn,
and EndTxn requests. This change limits this log line to only
be printed when the corresponding request is an export request.

Fixes: None
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

backupccl: print trace before timing out ExportRequest
5 participants