backupccl,kvserver: log failed ExportRequest trace on client and server #102793

adityamaru · 2023-05-05T03:42:34Z

This change strives to improve observability around
backups that fail because of timed out ExportRequests.
Currently, there is very little indication of what the request
was doing when the client cancelled the context after
the pre-determined timeout window. With this change we
now log the trace of the ExportRequest that failed. If
the ExportRequest was served locally, then the trace will be
part of the sender's tracing span. However, if the request
was served on a remote node then the sender does not wait
for the server side evaluation to notice the context cancellation.
To work around this, we also print the trace on the server side
if the request encountered a context cancellation and the
associating tracing span is not a noop.

This change also adds a private cluster setting
bulkio.backup.export_request_verbose_tracing that if set to true
will send all backup export requests with verbose tracing
mode.

To debug a backup failing with a timed out export request we
can now:

SET CLUSTER SETTING bulkio.backup.export_request_verbose_tracing = true;
SET CLUSTER SETTING trace.snapshot.rate = '1m'

Once the backup times out we can look at the logs
for the server side and client side ExportRequest traces
to then understand where we were stuck executing for so long.

Fixes: #86047
Release note: None

blathers-crl · 2023-05-05T03:42:37Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-05-05T03:42:42Z

This change is

adityamaru · 2023-05-05T03:53:11Z

@arulajmani not sure how much you like this approach but as discussed offline for remotely served ExportRequests DistSender does not wait for the batch response before returning an error to the client. For this reason, the trace emitted on the client will only contain calls upto DistSender.Send. To work around this I've added a server side log line in case the BatchRequest encounters a context cancellation. This way if an ExportRequest were to timeout we'd atleast have the trace for that request in the logs to understand why it timed out.

Let me know if you think its useful to push forward with this diff. Some experimentation in roachprod shows that if we always send ExportRequests as verbose then for requests that take longer than the timeout we see the following log line, which is definitely more than what we have today. A structured trace would give us less that what is pasted below, but would be a good starting point. I want to leave the switch to verbose tracing spans to a follow up that is done with some performance testing of different backups.

adityamaru-test-0003> W230504 22:13:50.766535 4269 server/node.go:1227 ⋮ [T1,n3] 237 request: ‹1 Export›
error: ‹context canceled›
trace: ‹     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch _unfinished:1 _verbose:1 node:3 span.kind:server request:Export [/Table/104/1,/Table/104/2), [wait-policy: Error]›
‹     0.000ms      0.000ms    [evalExport: {count: 1, duration 40s}]›
‹     0.000ms      0.000ms    [storage.MVCCExportToSST: {count: 1, duration 40s}]›
‹     0.039ms      0.039ms    event:server/node.go:1171 [n3] node received request: 1 Export›
‹     0.122ms      0.083ms    event:kv/kvserver/store_send.go:154 [n3,s3] executing Export [/Table/104/1,/Table/104/2), [wait-policy: Error]›
‹     0.159ms      0.037ms    event:kv/kvserver/replica_send.go:180 [n3,s3,r64/3:/{Table/104-Max}] read-only path›
‹     0.184ms      0.025ms    event:kv/kvserver/concurrency/concurrency_manager.go:195 [n3,s3,r64/3:/{Table/104-Max}] sequencing request›
‹     0.198ms      0.014ms    event:kv/kvserver/concurrency/concurrency_manager.go:276 [n3,s3,r64/3:/{Table/104-Max}] acquiring latches›
‹     0.217ms      0.020ms    event:kv/kvserver/concurrency/concurrency_manager.go:320 [n3,s3,r64/3:/{Table/104-Max}] scanning lock table for conflicting locks›
‹     0.257ms      0.039ms    event:kv/kvserver/replica_follower_read.go:126 [n3,s3,r64/3:/{Table/104-Max}] serving via follower read; query timestamp below closed timestamp by 1.71932036s›
‹     0.333ms      0.076ms    event:kv/kvserver/replica_read.go:297 [n3,s3,r64/3:/{Table/104-Max}] can drop latches early for batch (Export [/Table/104/1,/Table/104/2), [wait-policy: Error]); scanning lock table first to detect conflicts›
‹     0.399ms      0.066ms    event:kv/kvserver/replica_read.go:114 [n3,s3,r64/3:/{Table/104-Max}] lock table scan complete without conflicts; dropping latches early›
‹     0.415ms      0.016ms    event:kv/kvserver/replica_read.go:409 [n3,s3,r64/3:/{Table/104-Max}] executing read-only batch›
‹     0.422ms      0.007ms        === operation:evalExport _verbose:1 node:3 store:3 range:64/3:/{Table/104-Max}›
‹     0.422ms      0.000ms        [storage.MVCCExportToSST: {count: 1, duration 40s}]›
‹     0.427ms      0.005ms            === operation:storage.MVCCExportToSST _verbose:1 node:3 store:3 range:64/3:/{Table/104-Max}›
‹ 40001.896ms  40001.474ms        event:kv/kvserver/batcheval/cmd_export.go:243 [n3,s3,r64/3:/{Table/104-Max}] the error we are getting is context canceled›
‹ 40002.107ms  40001.692ms    event:kv/kvserver/replica_evaluate.go:550 [n3,s3,r64/3:/{Table/104-Max}] evaluated Export command header:<kvnemesis_seq:<> key:"\360\211" end_key:"\360\212" > resume_key_ts:<> start_time:<> split_mid_key:true target_file_size:16777216 fingerprint_options:<> , txn=<nil> : resp=header:<> start_time:<> , err=context canceled›
‹ 40002.192ms      0.085ms    event:kv/kvserver/replica_read.go:217 [n3,s3,r64/3:/{Table/104-Max}] context canceled›
‹ 40002.241ms      0.049ms    event:kv/kvserver/replica_send.go:199 [n3,s3,r64/3:/{Table/104-Max}] replica.Send got error: context canceled›
‹ 40002.274ms      0.033ms    event:server/node.go:1204 [n3] error from stores.Send: context canceled›
‹ 40002.481ms      0.207ms    event:stack as of 21.3s ago: goroutine 4269 [sleep]:›
‹time.Sleep(0x9502f9000)›
‹       GOROOT/src/runtime/time.go:195 +0x135›
‹github.com/cockroachdb/cockroach/pkg/storage.mvccExportToWriter({0x74c48b0, 0xc004a615c0}, {0x7505f10, 0xc005618600}, {{{0xc00a086a08, 0x2, 0x8}, {0x0, 0x0, 0x0}}, ...}, ...)›
‹       github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:6519 +0x399›
‹github.com/cockroachdb/cockroach/pkg/storage.MVCCExportToSST({0x74c48b0?, 0xc004a61560?}, 0x754ea90?, {0x7505f10, 0xc005618600}, {{{0xc00a086a08, 0x2, 0x8}, {0x0, 0x0, ...}}, ...}, ...)›
‹       github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:6349 +0x2a5›
‹github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.evalExport({_, _}, {_, _}, {{0x75632b0, 0xc003bf1500}, {{0x175c1064a34626bb, 0x0, 0x0}, 0x0, ...}, ...}, ...)›
‹       github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/cmd_export.go:232 +0xb36›
‹github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand({_, _}, {_, _}, {_, _}, _, _, {{0x175c1064a34626bb, 0x0, ...}, ...}, ...)›
‹       github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:528 +0x4c7›
‹github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch({_, _}, {_, _}, {_, _}, {_, _}, _, 0xc004acf7a0, ...)›
‹       github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:294 +0xea5›

arulajmani

Thanks for doing this, this is sick! I have no opinions on the approach to be honest.

I want to leave the switch to verbose tracing spans to a follow up that is done with some performance testing of different backups.

Consider putting it behind a cluster setting, defaulted to false. That way, we can reach for verbose tracing when running experiments/or whatever else.

Reviewable status: complete! 0 of 0 LGTMs obtained

adityamaru · 2023-05-05T21:13:42Z

I changed the error wrapping to a log line just because this error is stored in the jobs table and is surfaced on the console, so having a big trace dumped in the error doesn't look great.

The only part I'm antsy about is the change to node.go so let's loop in a couple more folks that might have opinions.

pkg/ccl/backupccl/backup_processor.go

pkg/kv/sender.go

pkg/server/node.go

adityamaru · 2023-05-10T16:53:28Z

#103034 will make the use of log.Error correct

irfansharif

LGTM, but you could just inline what you're doing in SendWrappedWithAdmissionTraced like the comment below suggests.

This change strives to improve observability around backups that fail because of timed out ExportRequests. Currently, there is very little indication of what the request was doing when the client cancelled the context after the pre-determined timeout window. With this change we now log the trace of the ExportRequest that failed. If the ExportRequest was served locally, then the trace will be part of the sender's tracing span. However, if the request was served on a remote node then the sender does not wait for the server side evaluation to notice the context cancellation. To work around this, we also print the trace on the server side if the request encountered a context cancellation and the associating tracing span is not a noop. This change also adds a private cluster setting `bulkio.backup.export_request_verbose_tracing` that if set to true will send all backup export requests with verbose tracing mode. To debug a backup failing with a timed out export request we can now: - SET CLUSTER SETTING bulkio.backup.export_request_verbose_tracing = true; - SET CLUSTER SETTING trace.snapshot.rate = '1m' Once the backup times out we can look at the logs for the server side and client side ExportRequest traces to then understand where we were stuck executing for so long. Fixes: cockroachdb#86047 Release note: None

adityamaru · 2023-05-15T19:34:56Z

Flake is #99261

bors r=irfansharif

craig · 2023-05-15T20:09:24Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-05-15T21:28:38Z

Build succeeded:

Bazel Essential CI (Cockroach)

knz · 2023-05-16T10:42:06Z

pkg/ccl/backupccl/backup_processor.go

@@ -453,6 +471,9 @@ func runBackupProcessor(
 						// TimeoutError improves the opaque `context deadline exceeded` error
 						// message so use that instead.
 						if errors.HasType(exportRequestErr, (*contextutil.TimeoutError)(nil)) {
+							if recording != nil {
+								log.Errorf(ctx, "failed export request for span %s\n trace:\n%s", span.span, recording)


here and below: unnecessary space before trace.

knz · 2023-05-16T10:44:37Z

pkg/ccl/backupccl/backup_processor.go

+
+	sendExportRequestWithVerboseTracing = settings.RegisterBoolSetting(
+		settings.TenantWritable,
+		"bulkio.backup.export_request_verbose_tracing",


How come this name was validated by the test TestLintClusterSettingNames in pkg/sql/show_test.go? It's not valid per that linter. Ditto the _delay one above.

adityamaru · 2023-06-01T16:48:54Z

blathers backport 23.1 22.2

blathers-crl · 2023-06-01T16:49:06Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from d59a111 to blathers/backport-release-22.2-102793: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

In #102793 we added server side logging for batch requests that would encounted a context cancelation or timeout. This was primarily motivated by the need to understand why export requests sent during a backup were spending most of their time and timing out. This unintentionally added log spam for other kinds of internal requests such as HeartbeatTxn, QueryTxn, and EndTxn requests. This change limits this log line to only be printed when the corresponding request is an export request. Fixes: None Release note: None

115098: server: reduce log spam on canceled batch requests r=erikgrinaker a=adityamaru In #102793 we added server side logging for batch requests that would encounted a context cancelation or timeout. This was primarily motivated by the need to understand why export requests sent during a backup were spending most of their time and timing out. This unintentionally added log spam for other kinds of internal requests such as HeartbeatTxn, QueryTxn, and EndTxn requests. This change limits this log line to only be printed when the corresponding request is an export request. Fixes: None Epic: none Release note: None Co-authored-by: adityamaru <[email protected]>

In #102793 we added server side logging for batch requests that would encounted a context cancelation or timeout. This was primarily motivated by the need to understand why export requests sent during a backup were spending most of their time and timing out. This unintentionally added log spam for other kinds of internal requests such as HeartbeatTxn, QueryTxn, and EndTxn requests. This change limits this log line to only be printed when the corresponding request is an export request. Fixes: None Release note: None

arulajmani reviewed May 5, 2023

View reviewed changes

adityamaru force-pushed the print-export-request-trace branch from 8ec9cf9 to 1538094 Compare May 5, 2023 21:11

adityamaru changed the title ~~[WIP, DNM]: print batch request trace on server side context cancellation~~ backupccl,kvserver: log failed ExportRequest trace on client and server May 5, 2023

adityamaru marked this pull request as ready for review May 5, 2023 21:13

adityamaru requested review from a team as code owners May 5, 2023 21:13

adityamaru requested review from a team, lidorcarmel, stevendanna and irfansharif and removed request for a team and lidorcarmel May 5, 2023 21:13

irfansharif reviewed May 10, 2023

View reviewed changes

pkg/ccl/backupccl/backup_processor.go Outdated Show resolved Hide resolved

pkg/kv/sender.go Outdated Show resolved Hide resolved

pkg/server/node.go Show resolved Hide resolved

irfansharif approved these changes May 11, 2023

View reviewed changes

adityamaru force-pushed the print-export-request-trace branch from 1538094 to 43ad7ef Compare May 11, 2023 20:42

adityamaru force-pushed the print-export-request-trace branch from 43ad7ef to d59a111 Compare May 15, 2023 16:14

craig bot merged commit 7b86c7c into cockroachdb:master May 15, 2023

knz reviewed May 16, 2023

View reviewed changes

blathers-crl bot mentioned this pull request Jun 1, 2023

release-23.1: backupccl,kvserver: log failed ExportRequest trace on client and server #104214

Closed

adityamaru deleted the print-export-request-trace branch June 8, 2023 18:57

This was referenced Jun 22, 2023

roachtest: admission-control/index-backfill failed #105260

Closed

server: spammy trace logging on context cancellation #105378

Closed

roachtest: failover/chaos/read-write failed #104709

Closed

erikgrinaker mentioned this pull request Jul 7, 2023

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #106022

Closed

adityamaru mentioned this pull request Jul 11, 2023

release-23.1: log server-side traces for timed out or canceled requests #106611

Merged

adityamaru mentioned this pull request Nov 27, 2023

server: reduce log spam on canceled batch requests #115098

Merged

blathers-crl bot mentioned this pull request Nov 27, 2023

release-23.2: server: reduce log spam on canceled batch requests #115104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl,kvserver: log failed ExportRequest trace on client and server #102793

backupccl,kvserver: log failed ExportRequest trace on client and server #102793

adityamaru commented May 5, 2023 •

edited

Loading

blathers-crl bot commented May 5, 2023

cockroach-teamcity commented May 5, 2023

adityamaru commented May 5, 2023 •

edited

Loading

arulajmani left a comment

adityamaru commented May 5, 2023

adityamaru commented May 10, 2023

irfansharif left a comment

adityamaru commented May 15, 2023

craig bot commented May 15, 2023

craig bot commented May 15, 2023

knz May 16, 2023

knz May 16, 2023

adityamaru commented Jun 1, 2023

blathers-crl bot commented Jun 1, 2023

backupccl,kvserver: log failed ExportRequest trace on client and server #102793

backupccl,kvserver: log failed ExportRequest trace on client and server #102793

Conversation

adityamaru commented May 5, 2023 • edited Loading

blathers-crl bot commented May 5, 2023

cockroach-teamcity commented May 5, 2023

adityamaru commented May 5, 2023 • edited Loading

arulajmani left a comment

Choose a reason for hiding this comment

adityamaru commented May 5, 2023

adityamaru commented May 10, 2023

irfansharif left a comment

Choose a reason for hiding this comment

adityamaru commented May 15, 2023

craig bot commented May 15, 2023

craig bot commented May 15, 2023

knz May 16, 2023

Choose a reason for hiding this comment

knz May 16, 2023

Choose a reason for hiding this comment

adityamaru commented Jun 1, 2023

blathers-crl bot commented Jun 1, 2023

adityamaru commented May 5, 2023 •

edited

Loading

adityamaru commented May 5, 2023 •

edited

Loading