roachtest: c2c/BulkOps/full failed #114706

cockroach-teamcity · 2023-11-19T11:54:07Z

roachtest.c2c/BulkOps/full failed with artifacts on release-23.2 @ c9f6f30496fe34af8d17410e5307f359458aaf52:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:749
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1598
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:745
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:727
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1598
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-33644}

The text was updated successfully, but these errors were encountered:

msbutler · 2023-11-20T15:55:01Z

well this time we did replan:

7.unredacted/cockroach.log:W231119 11:10:51.006357 5963 ccl/streamingccl/streamingest/stream_ingestion_job.go:90 ⋮ [T1,Vsystem,n3,job=‹REPLICATION STREAM INGESTION id=918554480988225540›] 487  waiting before retrying error: node 4 is 18.84 minutes behind the next node. Try replanning: node frontier too far behind other nodes

I'll need to figure out if we then redistributed the catchup scans.

msbutler · 2023-11-20T16:17:12Z

Rats, even after redistribution, we never caught up:

msbutler · 2023-11-21T17:03:04Z

Huh, well actually, grafana fooled me by cropping the timeseries. It seems like node 4 should have replanned towards the end of the test, but never did:

msbutler · 2023-11-21T17:10:25Z

Ohhh, we never replanned because of this code snippet:

cockroach/pkg/ccl/streamingccl/streamingest/stream_ingestion_frontier_processor.go

Line 547 in b61cbb9

if sf.replicatedTimeAtStart.Equal(sf.persistedReplicatedTime) {

// Don't check for lagging nodes if the hwm has yet to advance.
	if sf.replicatedTimeAtStart.Equal(sf.persistedReplicatedTime) {
		log.VEventf(ctx, 2, "skipping lag replanning check: hwm has yet to advance past %s", sf.replicatedTimeAtStart)
		return nil
	}

cockroach-teamcity · 2023-11-22T11:55:11Z

roachtest.c2c/BulkOps/full failed with artifacts on release-23.2 @ f47ba2eda81179ea16b868df55986614839d17e9:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:749
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1598
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:745
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:727
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1598
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

msbutler · 2023-11-22T16:00:53Z

Latest failure looks like we triggered a catchup scan after everything was caught up:

…t advanced Previously, the frontier processor would return a lag replanning error if it detected a lagging node and after the hwm had advanced during the flow. This implies the frontier processor could replan as soon as a lagging node finished its catchup scan and bumped the hwm, but was still far behind the other nodes, as we observed in #114706. Ideally, the frontier processor should not throw this replanning error because the lagging node is making progress and because replanning can cause repeated work. This patch prevents this scenario by teaching the frontier processor to only throw a replanning error if: - the hwm has advanced in the flow - two consecutive lagging node checks detected a lagging node and the hwm has not advanced during those two checks. Informs #114706 Release note: none

114861: server: remove admin checks from server-v2 r=rafiss a=rafiss These are now replaced by checks for the VIEWCLUSTERMETADATA privilege. The equivalent change was already made on the v1 server. The following endpoints are affected (these endpoints are not used in the frontend, so this change does not have any user-facing impact): - /sessions/ - /nodes/ - /nodes/{node_id}/ranges/ - /ranges/{range_id:[0-9]+}/ - /events/ informs #114384 informs #79571 informs #109814 Release note: None 115000: streamingccl: only return lag replanning error if lagging node has not advanced r=stevendanna a=msbutler Previously, the frontier processor would return a lag replanning error if it detected a lagging node and after the hwm had advanced during the flow. This implies the frontier processor could replan as soon as a lagging node finished its catchup scan and bumped the hwm, but was still far behind the other nodes, as we observed in #114706. Ideally, the frontier processor should not throw this replanning error because the lagging node is making progress and because replanning can cause repeated work. This patch prevents this scenario by teaching the frontier processor to only throw a replanning error if: - the hwm has advanced in the flow - two consecutive lagging node checks detected a lagging node and the hwm has not advanced during those two checks. Informs #114706 Release note: none 115046: catalog/lease: fix flake for TestTableCreationPushesTxnsInRecentPast r=fqazi a=fqazi Previously, TestTableCreationPushesTxnsInRecentPast could flake because we attempted to increase odds of hitting the uncertainty interval error by adding a delay on KV RPC calls. This wasn't effective and could still intermittent failures, and instead we are going to directly modify the uncertainty interval by setting a large MaxOffset on the clock. This will cause the desired behaviour in a more deterministic way. Fixes: #114366 Release note: None Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Faizan Qazi <[email protected]>

…t advanced Previously, the frontier processor would return a lag replanning error if it detected a lagging node and after the hwm had advanced during the flow. This implies the frontier processor could replan as soon as a lagging node finished its catchup scan and bumped the hwm, but was still far behind the other nodes, as we observed in #114706. Ideally, the frontier processor should not throw this replanning error because the lagging node is making progress and because replanning can cause repeated work. This patch prevents this scenario by teaching the frontier processor to only throw a replanning error if: - the hwm has advanced in the flow - two consecutive lagging node checks detected a lagging node and the hwm has not advanced during those two checks. Informs #114706 Release note: none

…tting This patch adds the default off physical_replication.consumer.split_on_job_retry setting, which, when enabled issues admin splits over the topology after a job level retry triggered by distSQL replanning. This setting may help c2c catch up more quickly after a replanning event, as it would prevent the destination side from issueing admin splits during catchup scans. Informs cockroachdb#114706 Release note: none

msbutler · 2023-11-28T21:11:17Z

fixed by #115101

msbutler · 2023-11-28T21:43:53Z

argh, closed this on the wrong backport. we're waiting on #115103 to merge

cockroach-teamcity · 2023-11-29T11:50:16Z

roachtest.c2c/BulkOps/full failed with artifacts on release-23.2 @ 48d5dc3efefacae5cebd23ac81c46f4248eb39f0:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:749
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1598
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:745
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:727
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1598
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

msbutler · 2023-11-29T14:32:13Z

The latest failure could perhaps benefit from splits on retry:

…tting This patch adds the default off physical_replication.consumer.split_on_job_retry setting, which, when enabled issues admin splits over the topology after a job level retry triggered by distSQL replanning. This setting may help c2c catch up more quickly after a replanning event, as it would prevent the destination side from issueing admin splits during catchup scans. Informs cockroachdb#114706 Release note: none

msbutler · 2023-12-01T20:30:50Z

closing this in favor of #115415

cockroach-teamcity added this to the 23.2 milestone Nov 19, 2023

benbardin assigned msbutler Nov 20, 2023

msbutler added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 20, 2023

This was referenced Nov 22, 2023

streamingccl: add physical_replication.consumer.split_on_job_retry setting #114986

Open

streamingccl: only return lag replanning error if lagging node has not advanced #115000

Merged

cockroach-teamcity mentioned this issue Nov 24, 2023

roachtest: c2c/BulkOps/full failed #115052

Closed

blathers-crl bot mentioned this issue Nov 27, 2023

release-23.2: streamingccl: only return lag replanning error if lagging node has not advanced #115103

Merged

msbutler closed this as completed Nov 28, 2023

msbutler reopened this Nov 28, 2023

dt removed GA-blocker branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Nov 29, 2023

This was referenced Dec 1, 2023

roachtest: c2c/BulkOps/full failed #115409

Closed

roachtest: c2c/BulkOps/full failed #115415

Closed

msbutler added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 1, 2023

msbutler closed this as completed Dec 1, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: c2c/BulkOps/full failed #114706

roachtest: c2c/BulkOps/full failed #114706

cockroach-teamcity commented Nov 19, 2023 •

edited by rickystewart

Loading

msbutler commented Nov 20, 2023

msbutler commented Nov 20, 2023 •

edited

Loading

msbutler commented Nov 21, 2023

msbutler commented Nov 21, 2023

cockroach-teamcity commented Nov 22, 2023

msbutler commented Nov 22, 2023

msbutler commented Nov 28, 2023

msbutler commented Nov 28, 2023

cockroach-teamcity commented Nov 29, 2023

msbutler commented Nov 29, 2023

msbutler commented Dec 1, 2023

roachtest: c2c/BulkOps/full failed #114706

roachtest: c2c/BulkOps/full failed #114706

Comments

cockroach-teamcity commented Nov 19, 2023 • edited by rickystewart Loading

msbutler commented Nov 20, 2023

msbutler commented Nov 20, 2023 • edited Loading

msbutler commented Nov 21, 2023

msbutler commented Nov 21, 2023

cockroach-teamcity commented Nov 22, 2023

msbutler commented Nov 22, 2023

msbutler commented Nov 28, 2023

msbutler commented Nov 28, 2023

cockroach-teamcity commented Nov 29, 2023

msbutler commented Nov 29, 2023

msbutler commented Dec 1, 2023

cockroach-teamcity commented Nov 19, 2023 •

edited by rickystewart

Loading

msbutler commented Nov 20, 2023 •

edited

Loading