Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #106022

Closed
cockroach-teamcity opened this issue Jul 2, 2023 · 14 comments
Closed
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. sync-me T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 2, 2023

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on master @ aacba20d325e5702836e9a76be646b5f1bd922af:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:886
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	read tcp 172.17.0.3:35274->104.196.129.31:26257: read: connection reset by peer
	Test:       	c2c/tpcc/warehouses=1000/duration=60/cutover=30
(require.go:1360).NoError: FailNow called
(soon.go:64).SucceedsWithin: condition failed to evaluate within 1h30m0s: from cluster_to_cluster.go:540: no replicated time
(test_runner.go:1122).func1: 4 dead node(s) detected
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-29347

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Jul 2, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 2, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on master @ 428dc9da6a320de218460de6c6c8807caa4ded98:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:886
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	read tcp 172.17.0.3:48054->35.227.3.73:26257: read: connection reset by peer
	Test:       	c2c/tpcc/warehouses=1000/duration=60/cutover=30
(require.go:1360).NoError: FailNow called
(test_runner.go:1122).func1: 1 dead node(s) detected
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on master @ 34699bb9c1557fce449e08a68cd259efec94926f:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:886
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	read tcp 172.17.0.3:51780->34.73.121.23:26257: read: connection reset by peer
	Test:       	c2c/tpcc/warehouses=1000/duration=60/cutover=30
(require.go:1360).NoError: FailNow called
(test_runner.go:1122).func1: 1 dead node(s) detected
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@msbutler msbutler self-assigned this Jul 5, 2023
@msbutler
Copy link
Collaborator

msbutler commented Jul 6, 2023

This test has been failing every night due to dead node detection. Node 1 is consistently panicing because it has detected tracing span use after finish within the replica application decoder in kvserver. One next step could be to rerun this test with the env variable COCKROACH_DEBUG_SPAN_USE_AFTER_FINISH set to true, but before that, I want to ping replication folks to see if they've seen this before.

See trace found in node 1's logs:
I230705 09:48:36.338110 377642 server/node.go:1318 ⋮ [T1,n1] 19976 +‹     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch›
I230705 09:48:36.338819 377643 server/node.go:1318 ⋮ [T1,n1] 19977  batch request ‹QueryTxn [/Tenant/2/Table/106/1/431/0,/Min)› failed with error: ‹context canceled›
I230705 09:48:36.338819 377643 server/node.go:1318 ⋮ [T1,n1] 19977 +trace:
I230705 09:48:36.338819 377643 server/node.go:1318 ⋮ [T1,n1] 19977 +‹     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch›
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978  a panic has occurred!
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +use of Span after Finish. Span: ‹dist sender send›. Finish previously called at: ‹<stack not captured. Set debugUseAfterFinish>›
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +(1) attached stack trace
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  -- stack trace:
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | runtime.gopanic
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   GOROOT/src/runtime/panic.go:884
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | [...repeated from below...]
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +Wraps: (2) assertion failure
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +Wraps: (3) attached stack trace
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  -- stack trace:
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:182
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).Tracer
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:225
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).createTracingSpans
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:226
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).DecodeAndBind
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:63
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).Decode
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:142
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:901
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:751
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:660
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:418
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | runtime.goexit
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   GOROOT/src/runtime/asm_amd64.s:1594
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +Wraps: (4) use of Span after Finish. Span: ‹dist sender send›. Finish previously called at: ‹<stack not captured. Set debugUseAfterFinish>›
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +Error types: (1) *withstack.withStack (2) *assert.withAssertionFailure (3) *withstack.withStack (4) *errutil.leafError
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +HINT: ‹You have encountered an unexpected error.›

This occured during the initial scan of the test.

@blathers-crl
Copy link

blathers-crl bot commented Jul 6, 2023

cc @cockroachdb/replication

@msbutler
Copy link
Collaborator

msbutler commented Jul 6, 2023

adding @cockroachdb/replication because I'm wondering if something in your neck of the woods changed in the past week that could be related to this panic. See my read of the logs here.

@erikgrinaker
Copy link
Contributor

Might be fallout from the recent reproposal refactor, cc @tbg.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 7, 2023

This might also be related to the trace logging that DR added in #102793, which is immediately before the panic in the logs. More likely though, the caller finishes the span and we keep using it during application. I'm not familiar with the details here.

cockroach/pkg/server/node.go

Lines 1316 to 1321 in 66c9f93

if pErr != nil && ctx.Err() != nil {
if sp := tracing.SpanFromContext(ctx); sp != nil && !sp.IsNoop() {
log.Infof(ctx, "batch request %s failed with error: %s\ntrace:\n%s", args.String(),
pErr.GoError().Error(), sp.GetConfiguredRecording().String())
}
}

@cockroach-teamcity
Copy link
Member Author

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on master @ 43c26aec0072f76e02e6d5ffc1b7079026b24630:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:913
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	read tcp 172.17.0.3:57742->35.231.191.38:26257: read: connection reset by peer
	Test:       	c2c/tpcc/warehouses=1000/duration=60/cutover=30
(require.go:1360).NoError: FailNow called
(sql_runner.go:82).Exec: error executing 'ALTER TENANT $1 COMPLETE REPLICATION TO SYSTEM TIME $2::string': pq: cutover time 1688809497.972791000,0 is before earliest safe cutover time 1688809830.746973004,0
(test_runner.go:1122).func1: 1 dead node(s) detected
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on master @ 0207c613fa7c8f3ab66c4518ee1e52dabb863426:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:913
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	read tcp 172.17.0.3:33848->34.148.122.18:26257: read: connection reset by peer
	Test:       	c2c/tpcc/warehouses=1000/duration=60/cutover=30
(require.go:1360).NoError: FailNow called
(test_runner.go:1122).func1: 1 dead node(s) detected
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@adityamaru adityamaru removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 11, 2023
@msbutler
Copy link
Collaborator

@erikgrinaker this test hasn't failed in 3 days. Are you aware of some patch in replication land that may have fixed this?

@erikgrinaker
Copy link
Contributor

No. I won't have time to look into this, @pavelkalinnikov is on test flake duty for replication.

@blathers-crl
Copy link

blathers-crl bot commented Jul 17, 2023

cc @cockroachdb/replication

@msbutler
Copy link
Collaborator

msbutler commented Jul 17, 2023

Given that this consistently flaked for 4 days and now hasn't flaked for a week, I'm inclined to close this. @pavelkalinnikov I'll let you make the final call.

@tbg
Copy link
Member

tbg commented Jul 18, 2023

The closest I can think of is #105877 but that merged on June 30. The test last failed on July 10.

@andrewbaptist andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. sync-me T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants