roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #106022

cockroach-teamcity · 2023-07-02T11:16:03Z

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on master @ aacba20d325e5702836e9a76be646b5f1bd922af:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:886
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	read tcp 172.17.0.3:35274->104.196.129.31:26257: read: connection reset by peer
	Test:       	c2c/tpcc/warehouses=1000/duration=60/cutover=30
(require.go:1360).NoError: FailNow called
(soon.go:64).SucceedsWithin: condition failed to evaluate within 1h30m0s: from cluster_to_cluster.go:540: no replicated time
(test_runner.go:1122).func1: 4 dead node(s) detected
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-29347}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2023-07-04T10:05:29Z

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on master @ 428dc9da6a320de218460de6c6c8807caa4ded98:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:886
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	read tcp 172.17.0.3:48054->35.227.3.73:26257: read: connection reset by peer
	Test:       	c2c/tpcc/warehouses=1000/duration=60/cutover=30
(require.go:1360).NoError: FailNow called
(test_runner.go:1122).func1: 1 dead node(s) detected
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-07-05T10:10:45Z

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on master @ 34699bb9c1557fce449e08a68cd259efec94926f:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:886
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	read tcp 172.17.0.3:51780->34.73.121.23:26257: read: connection reset by peer
	Test:       	c2c/tpcc/warehouses=1000/duration=60/cutover=30
(require.go:1360).NoError: FailNow called
(test_runner.go:1122).func1: 1 dead node(s) detected
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

msbutler · 2023-07-06T12:47:53Z

This test has been failing every night due to dead node detection. Node 1 is consistently panicing because it has detected tracing span use after finish within the replica application decoder in kvserver. One next step could be to rerun this test with the env variable COCKROACH_DEBUG_SPAN_USE_AFTER_FINISH set to true, but before that, I want to ping replication folks to see if they've seen this before.

See trace found in node 1's logs:

I230705 09:48:36.338110 377642 server/node.go:1318 ⋮ [T1,n1] 19976 +‹     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch›
I230705 09:48:36.338819 377643 server/node.go:1318 ⋮ [T1,n1] 19977  batch request ‹QueryTxn [/Tenant/2/Table/106/1/431/0,/Min)› failed with error: ‹context canceled›
I230705 09:48:36.338819 377643 server/node.go:1318 ⋮ [T1,n1] 19977 +trace:
I230705 09:48:36.338819 377643 server/node.go:1318 ⋮ [T1,n1] 19977 +‹     0.000ms      0.000ms    === operation:/cockroach.roachpb.Internal/Batch›
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978  a panic has occurred!
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +use of Span after Finish. Span: ‹dist sender send›. Finish previously called at: ‹<stack not captured. Set debugUseAfterFinish>›
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +(1) attached stack trace
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  -- stack trace:
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | runtime.gopanic
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   GOROOT/src/runtime/panic.go:884
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | [...repeated from below...]
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +Wraps: (2) assertion failure
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +Wraps: (3) attached stack trace
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  -- stack trace:
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:182
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).Tracer
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:225
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).createTracingSpans
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:226
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).DecodeAndBind
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:63
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).Decode
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:142
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:901
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:751
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:660
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:418
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  | runtime.goexit
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +  |   GOROOT/src/runtime/asm_amd64.s:1594
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +Wraps: (4) use of Span after Finish. Span: ‹dist sender send›. Finish previously called at: ‹<stack not captured. Set debugUseAfterFinish>›
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +Error types: (1) *withstack.withStack (2) *assert.withAssertionFailure (3) *withstack.withStack (4) *errutil.leafError
E230705 09:48:36.341495 334 1@util/log/logcrash/crash_reporting.go:188 ⋮ [T1,n1] 19978 +HINT: ‹You have encountered an unexpected error.›

This occured during the initial scan of the test.

blathers-crl · 2023-07-06T13:00:38Z

cc @cockroachdb/replication

msbutler · 2023-07-06T13:03:17Z

adding @cockroachdb/replication because I'm wondering if something in your neck of the woods changed in the past week that could be related to this panic. See my read of the logs here.

erikgrinaker · 2023-07-06T13:32:21Z

Might be fallout from the recent reproposal refactor, cc @tbg.

erikgrinaker · 2023-07-07T11:54:58Z

This might also be related to the trace logging that DR added in #102793, which is immediately before the panic in the logs. More likely though, the caller finishes the span and we keep using it during application. I'm not familiar with the details here.

cockroach/pkg/server/node.go

Lines 1316 to 1321 in 66c9f93

    
           if pErr != nil && ctx.Err() != nil { 
        
           	if sp := tracing.SpanFromContext(ctx); sp != nil && !sp.IsNoop() { 
        
           		log.Infof(ctx, "batch request %s failed with error: %s\ntrace:\n%s", args.String(), 
        
           			pErr.GoError().Error(), sp.GetConfiguredRecording().String()) 
        
           	} 
        
           }

cockroach-teamcity · 2023-07-08T10:16:03Z

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on master @ 43c26aec0072f76e02e6d5ffc1b7079026b24630:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:913
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	read tcp 172.17.0.3:57742->35.231.191.38:26257: read: connection reset by peer
	Test:       	c2c/tpcc/warehouses=1000/duration=60/cutover=30
(require.go:1360).NoError: FailNow called
(sql_runner.go:82).Exec: error executing 'ALTER TENANT $1 COMPLETE REPLICATION TO SYSTEM TIME $2::string': pq: cutover time 1688809497.972791000,0 is before earliest safe cutover time 1688809830.746973004,0
(test_runner.go:1122).func1: 1 dead node(s) detected
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-07-10T10:11:01Z

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on master @ 0207c613fa7c8f3ab66c4518ee1e52dabb863426:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:913
	            				main/pkg/cmd/roachtest/monitor.go:105
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				GOROOT/src/runtime/asm_amd64.s:1594
	Error:      	Received unexpected error:
	            	read tcp 172.17.0.3:33848->34.148.122.18:26257: read: connection reset by peer
	Test:       	c2c/tpcc/warehouses=1000/duration=60/cutover=30
(require.go:1360).NoError: FailNow called
(test_runner.go:1122).func1: 1 dead node(s) detected
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

msbutler · 2023-07-13T22:00:32Z

@erikgrinaker this test hasn't failed in 3 days. Are you aware of some patch in replication land that may have fixed this?

erikgrinaker · 2023-07-17T10:13:51Z

No. I won't have time to look into this, @pavelkalinnikov is on test flake duty for replication.

blathers-crl · 2023-07-17T10:14:05Z

cc @cockroachdb/replication

msbutler · 2023-07-17T14:13:33Z

Given that this consistently flaked for 4 days and now hasn't flaked for a week, I'm inclined to close this. @pavelkalinnikov I'll let you make the final call.

tbg · 2023-07-18T14:58:36Z

The closest I can think of is #105877 but that merged on June 30. The test last failed on July 10.

cockroach-teamcity added this to the 23.2 milestone Jul 2, 2023

msbutler self-assigned this Jul 5, 2023

msbutler added the T-kv-replication label Jul 6, 2023

exalate-issue-sync bot added sync-me and removed T-kv-replication labels Jul 11, 2023

adityamaru removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 11, 2023

erikgrinaker assigned pav-kv Jul 17, 2023

erikgrinaker added the T-kv-replication label Jul 17, 2023

andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 19, 2023

tbg closed this as completed Jul 24, 2023

exalate-issue-sync bot unassigned msbutler Jul 24, 2023

exalate-issue-sync bot removed the T-kv-replication label Jul 24, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #106022

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #106022

cockroach-teamcity commented Jul 2, 2023 •

edited by cockroach-jira-scripts

Loading

cockroach-teamcity commented Jul 4, 2023

cockroach-teamcity commented Jul 5, 2023

msbutler commented Jul 6, 2023 •

edited

Loading

blathers-crl bot commented Jul 6, 2023

msbutler commented Jul 6, 2023 •

edited

Loading

erikgrinaker commented Jul 6, 2023

erikgrinaker commented Jul 7, 2023 •

edited

Loading

cockroach-teamcity commented Jul 8, 2023

cockroach-teamcity commented Jul 10, 2023

msbutler commented Jul 13, 2023

erikgrinaker commented Jul 17, 2023

blathers-crl bot commented Jul 17, 2023

msbutler commented Jul 17, 2023 •

edited

Loading

tbg commented Jul 18, 2023

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #106022

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #106022

Comments

cockroach-teamcity commented Jul 2, 2023 • edited by cockroach-jira-scripts Loading

cockroach-teamcity commented Jul 4, 2023

cockroach-teamcity commented Jul 5, 2023

msbutler commented Jul 6, 2023 • edited Loading

blathers-crl bot commented Jul 6, 2023

msbutler commented Jul 6, 2023 • edited Loading

erikgrinaker commented Jul 6, 2023

erikgrinaker commented Jul 7, 2023 • edited Loading

cockroach-teamcity commented Jul 8, 2023

cockroach-teamcity commented Jul 10, 2023

msbutler commented Jul 13, 2023

erikgrinaker commented Jul 17, 2023

blathers-crl bot commented Jul 17, 2023

msbutler commented Jul 17, 2023 • edited Loading

tbg commented Jul 18, 2023

cockroach-teamcity commented Jul 2, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Jul 6, 2023 •

edited

Loading

msbutler commented Jul 6, 2023 •

edited

Loading

erikgrinaker commented Jul 7, 2023 •

edited

Loading

msbutler commented Jul 17, 2023 •

edited

Loading