Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: c2c/BulkOps/full failed #114706

Closed
cockroach-teamcity opened this issue Nov 19, 2023 · 11 comments
Closed

roachtest: c2c/BulkOps/full failed #114706

cockroach-teamcity opened this issue Nov 19, 2023 · 11 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 19, 2023

roachtest.c2c/BulkOps/full failed with artifacts on release-23.2 @ c9f6f30496fe34af8d17410e5307f359458aaf52:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:749
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1598
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:745
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:727
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1598
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-33644

@cockroach-teamcity cockroach-teamcity added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Nov 19, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Nov 19, 2023
@msbutler
Copy link
Collaborator

well this time we did replan:

7.unredacted/cockroach.log:W231119 11:10:51.006357 5963 ccl/streamingccl/streamingest/stream_ingestion_job.go:90 ⋮ [T1,Vsystem,n3,job=‹REPLICATION STREAM INGESTION id=918554480988225540›] 487  waiting before retrying error: node 4 is 18.84 minutes behind the next node. Try replanning: node frontier too far behind other nodes

I'll need to figure out if we then redistributed the catchup scans.

@msbutler
Copy link
Collaborator

msbutler commented Nov 20, 2023

Rats, even after redistribution, we never caught up:
image

@msbutler msbutler added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 20, 2023
@msbutler
Copy link
Collaborator

Huh, well actually, grafana fooled me by cropping the timeseries. It seems like node 4 should have replanned towards the end of the test, but never did:
image

@msbutler
Copy link
Collaborator

Ohhh, we never replanned because of this code snippet:

if sf.replicatedTimeAtStart.Equal(sf.persistedReplicatedTime) {

// Don't check for lagging nodes if the hwm has yet to advance.
	if sf.replicatedTimeAtStart.Equal(sf.persistedReplicatedTime) {
		log.VEventf(ctx, 2, "skipping lag replanning check: hwm has yet to advance past %s", sf.replicatedTimeAtStart)
		return nil
	}

@cockroach-teamcity
Copy link
Member Author

roachtest.c2c/BulkOps/full failed with artifacts on release-23.2 @ f47ba2eda81179ea16b868df55986614839d17e9:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:749
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1598
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:745
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:727
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1598
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@msbutler
Copy link
Collaborator

Latest failure looks like we triggered a catchup scan after everything was caught up:
image

craig bot pushed a commit that referenced this issue Nov 27, 2023
…t advanced

Previously, the frontier processor would return a lag replanning error if it
detected a lagging node and after the hwm had advanced during the flow. This
implies the frontier processor could replan as soon as a lagging node finished
its catchup scan and bumped the hwm, but was still far behind the other nodes,
as we observed in #114706. Ideally, the frontier processor should not throw
this replanning error because the lagging node is making progress and because
replanning can cause repeated work.

This patch prevents this scenario by teaching the frontier processor to only
throw a replanning error if:
- the hwm has advanced in the flow
- two consecutive lagging node checks detected a lagging node and the hwm has
  not advanced during those two checks.

Informs #114706

Release note: none
craig bot pushed a commit that referenced this issue Nov 27, 2023
114861: server: remove admin checks from server-v2 r=rafiss a=rafiss

These are now replaced by checks for the VIEWCLUSTERMETADATA privilege. The equivalent change was already made on the v1 server.

The following endpoints are affected (these endpoints are not used in the frontend, so this change does not have any user-facing impact):
- /sessions/
- /nodes/
- /nodes/{node_id}/ranges/
- /ranges/{range_id:[0-9]+}/
- /events/

informs #114384
informs #79571
informs #109814
Release note: None

115000: streamingccl: only return lag replanning error if lagging node has not advanced r=stevendanna a=msbutler

Previously, the frontier processor would return a lag replanning error if it detected a lagging node and after the hwm had advanced during the flow. This implies the frontier processor could replan as soon as a lagging node finished its catchup scan and bumped the hwm, but was still far behind the other nodes, as we observed in #114706. Ideally, the frontier processor should not throw this replanning error because the lagging node is making progress and because replanning can cause repeated work.

This patch prevents this scenario by teaching the frontier processor to only throw a replanning error if:
- the hwm has advanced in the flow
- two consecutive lagging node checks detected a lagging node and the hwm has not advanced during those two checks.

Informs #114706

Release note: none

115046: catalog/lease: fix flake for TestTableCreationPushesTxnsInRecentPast r=fqazi a=fqazi

Previously, TestTableCreationPushesTxnsInRecentPast could flake because we attempted to increase odds of hitting the uncertainty interval error by adding a delay on KV RPC calls. This wasn't effective and could still intermittent failures, and instead we are going to directly modify the uncertainty interval by setting a large MaxOffset on the clock. This will cause the desired behaviour in a more deterministic way.

Fixes: #114366

Release note: None

Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Nov 27, 2023
…t advanced

Previously, the frontier processor would return a lag replanning error if it
detected a lagging node and after the hwm had advanced during the flow. This
implies the frontier processor could replan as soon as a lagging node finished
its catchup scan and bumped the hwm, but was still far behind the other nodes,
as we observed in #114706. Ideally, the frontier processor should not throw
this replanning error because the lagging node is making progress and because
replanning can cause repeated work.

This patch prevents this scenario by teaching the frontier processor to only
throw a replanning error if:
- the hwm has advanced in the flow
- two consecutive lagging node checks detected a lagging node and the hwm has
  not advanced during those two checks.

Informs #114706

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 28, 2023
…tting

This patch adds the default off
physical_replication.consumer.split_on_job_retry setting, which, when enabled
issues admin splits over the topology after a job level retry triggered by
distSQL replanning. This setting may help c2c catch up more quickly after a
replanning event, as it would prevent the destination side from issueing admin
splits during catchup scans.

Informs cockroachdb#114706

Release note: none
@msbutler
Copy link
Collaborator

fixed by #115101

@msbutler
Copy link
Collaborator

argh, closed this on the wrong backport. we're waiting on #115103 to merge

@msbutler msbutler reopened this Nov 28, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.c2c/BulkOps/full failed with artifacts on release-23.2 @ 48d5dc3efefacae5cebd23ac81c46f4248eb39f0:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:749
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1598
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:745
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:727
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:978
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1268
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1598
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@dt dt removed GA-blocker branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Nov 29, 2023
@msbutler
Copy link
Collaborator

The latest failure could perhaps benefit from splits on retry:
image

msbutler added a commit to msbutler/cockroach that referenced this issue Nov 29, 2023
…tting

This patch adds the default off
physical_replication.consumer.split_on_job_retry setting, which, when enabled
issues admin splits over the topology after a job level retry triggered by
distSQL replanning. This setting may help c2c catch up more quickly after a
replanning event, as it would prevent the destination side from issueing admin
splits during catchup scans.

Informs cockroachdb#114706

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 30, 2023
…tting

This patch adds the default off
physical_replication.consumer.split_on_job_retry setting, which, when enabled
issues admin splits over the topology after a job level retry triggered by
distSQL replanning. This setting may help c2c catch up more quickly after a
replanning event, as it would prevent the destination side from issueing admin
splits during catchup scans.

Informs cockroachdb#114706

Release note: none
@msbutler msbutler added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 1, 2023
@msbutler
Copy link
Collaborator

msbutler commented Dec 1, 2023

closing this in favor of #115415

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants