Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: ycsb/E/nodes=3/cpu=32 failed #135163

Closed
cockroach-teamcity opened this issue Nov 14, 2024 · 5 comments
Closed

roachtest: ycsb/E/nodes=3/cpu=32 failed #135163

cockroach-teamcity opened this issue Nov 14, 2024 · 5 comments
Assignees
Labels
branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 14, 2024

roachtest.ycsb/E/nodes=3/cpu=32 failed with artifacts on release-24.2 @ 287e165b88ff5aa1aa3c9b9ff1303d859c1f960e:

(cluster.go:2394).Run: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
unexpected node event: n3: cockroach process for system interface died (exit code 7)
test artifacts and logs in: /artifacts/ycsb/E/nodes=3/cpu=32/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=32
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-44384

@cockroach-teamcity cockroach-teamcity added branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team labels Nov 14, 2024
@herkolategan
Copy link
Collaborator

Disk stall (node died)

E241114 13:42:14.931722 4885 jobs/metricspoller/poller.go:75 ⋮ [T1,Vsystem,n3,job=POLL JOBS STATS id=101] 663  Periodic stats collector task ‹manage-pts› completed with error txn exec: context canceled
E241114 13:42:14.931807 4885 jobs/registry.go:896 â‹® [T1,Vsystem,n3] 664  error getting live session: could not update session 010180fd27552219a44f5d93eadc413475db88: aborted in DistSender: result is ambiguous: context canceled
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608  disk stall detected: unable to sync log files within 20s
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !goroutine 2897925 [running]:
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !github.com/cockroachdb/cockroach/pkg/util/allstacks.GetWithBuf({0x0?, 0xc06b187978?, 0x6a9b45?})
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !	github.com/cockroachdb/cockroach/pkg/util/allstacks/allstacks.go:27 +0x74
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !github.com/cockroachdb/cockroach/pkg/util/allstacks.Get(...)
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !	github.com/cockroachdb/cockroach/pkg/util/allstacks/allstacks.go:14
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(_, {{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}}, ...})
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !	github.com/cockroachdb/cockroach/pkg/util/log/clog.go:294 +0xc6
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x835d3b8, 0xd551840}, 0x2, 0x4, 0x1, 0x1?, {0x6b97bac, 0x37}, {0xc06b187fc0, 0x1, ...})
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:104 +0x5c5
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !github.com/cockroachdb/cockroach/pkg/util/log.shoutfDepth(...)
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:111
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !github.com/cockroachdb/cockroach/pkg/util/log.loggerOps.Shoutf(...)
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/util/log/log_channels_generated.go:1489
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !github.com/cockroachdb/cockroach/pkg/util/log.(*fileSink).flushAndMaybeSyncLocked.func1()
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !	github.com/cockroachdb/cockroach/pkg/util/log/file.go:270 +0xb9
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !created by time.goFunc
F241114 13:42:14.921588 2897925 1@util/log/file.go:270 â‹® [-] 608 !	GOROOT/src/time/sleep.go:177 +0x2d

@herkolategan
Copy link
Collaborator

@cockroachdb/kv Not entirely sure what went wrong here, if someone could help take a look?

@herkolategan herkolategan added the T-kv KV Team label Nov 19, 2024
@exalate-issue-sync exalate-issue-sync bot removed the T-testeng TestEng Team label Nov 19, 2024
@iskettaneh
Copy link
Contributor

iskettaneh commented Nov 19, 2024

I see that for node 3 at 13:41:51.452358 we have this log: 2@gossip/gossip.go:1406 ⋮ [T1,Vsystem,n3] 583 first range unavailable; trying remaining addresses.
And then these two log lines:

W241114 13:41:51.579738 577 raft/raft.go:1223 ⋮ [T1,Vsystem,n3,s3,r76/2:‹/Table/106/1/"user{461…-578…}›] 584  2 stepped down to follower since quorum is not active
W241114 13:42:14.921525 2897923 1@util/log/file.go:279 ⋮ [-] 603  disk slowness detected: unable to sync log files within 10s

There are a bunch off error log lines like:

W241114 13:42:14.929378 646 kv/kvserver/liveness/liveness.go:667 ⋮ [T1,Vsystem,n3,liveness-hb] 624  failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 24.349s (given timeout 3s): result is ambiguous: context done during DistSender.Send: ba: ‹ConditionalPut [/System/NodeLiveness/3], EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/3], [txn: 86311fb4], [can-forward-ts]› RPC error: grpc: ‹context deadline exceeded› [code 4/DeadlineExceeded]
W241114 13:42:14.929892 2897905 kv/kvclient/kvcoord/dist_sender.go:2761 ⋮ [T1,Vsystem,n3,range-lookup=/System/NodeLiveness/3] 628  slow replica RPC: have been waiting 15.16s (0 attempts) for RPC Scan [/Meta2/System/NodeLiveness/3/NULL,/System‹/›‹"›‹"›), [max_span_request_keys: 9], [target_bytes: 0] to replica (n3,s3):2; resp: ‹(err: <nil>), *kvpb.ScanResponse›

10 seconds later, the node crashes with:

F241114 13:42:14.921588 2897925 1@util/log/file.go:270 ⋮ [-] 608  disk stall detected: unable to sync log files within 20s

Is there a way to check if there was actually a problem with the disk in GCE? I took a quick look at the metrics, but I wasn't sure which one do we usually use for that type of problems. Also, since node 3 crashed, some important metrics might not have been collected.

@arulajmani
Copy link
Collaborator

@iskettaneh given this is a disk stall, we typically close it out as an infra flake (assuming it isn't a test induced disk stall, which isn't the case for these YCSB tests).

@arulajmani arulajmani added X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 19, 2024
@srosenberg
Copy link
Member

Tracking issue for reference: #97968

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
None yet
Development

No branches or pull requests

5 participants