Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: c2c/BulkOps/full failed #114341

Closed
cockroach-teamcity opened this issue Nov 13, 2023 · 12 comments · Fixed by #114525
Closed

roachtest: c2c/BulkOps/full failed #114341

cockroach-teamcity opened this issue Nov 13, 2023 · 12 comments · Fixed by #114525
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 13, 2023

roachtest.c2c/BulkOps/full failed with artifacts on master @ 063fa2b8930019a16adc0bca5b8363d8e80f3132:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:750
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1650
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:746
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:728
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1650
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-33470

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Nov 13, 2023
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Nov 13, 2023
@msbutler
Copy link
Collaborator

Cutover failed on a timeout because the replication lag never caught up during the test. I wonder why the auto replanner didn't help.

11:15:39 latency_verifier.go:139: could not record value 38m4.650553667s: value 2284650553667 is too large to be recorded
11:15:40 latency_verifier.go:139: could not record value 38m5.690941624s: value 2285690941624 is too large to be recorded
11:15:41 latency_verifier.go:139: could not record value 38m6.730999152s: value 2286730999152 is too large to be recorded
11:15:42 latency_verifier.go:139: could not record value 38m7.771010502s: value 2287771010502 is too large to be recorded
11:15:43 cluster_to_cluster.go:906: workload successfully finished
11:15:43 cluster_to_cluster.go:944: workload finished on its own
11:15:43 cluster_to_cluster.go:962: test status: cutover time chosen: 2023-11-13 11:10:43.563407 +0000 UTC
11:15:43 cluster_to_cluster.go:648: LogicalMegabytes: 36181.85
11:15:43 cluster_to_cluster.go:648: PhysicalMegabytes: 21536.65
11:15:43 cluster_to_cluster.go:648: PhysicalReplicatedMegabytes: 56270.69
11:15:43 cluster_to_cluster.go:970: test status: waiting for replication stream to cutover to 2023-11-13 11:10:43.563407 +0000 UTC

@msbutler
Copy link
Collaborator

For some reason, node lag replanning was never executed:

❯ grep "replan" *.unredacted/cockroach.log | grep -v "import"
5.unredacted/cockroach.log:I231113 10:34:35.627903 5528 util/log/event_log.go:32 ⋮ [T1,Vsystem,n1,client=34.150.235.219:56372,hostssl,user=root] 216 ={"Timestamp":1699871675624390486,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"stream_replication.replan_flow_threshold\" = ‹0.1›","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"physical_replication.consumer.replan_flow_threshold","Value":"‹0.1›"}
5.unredacted/cockroach.log:I231113 10:34:35.644254 5528 util/log/event_log.go:32 ⋮ [T1,Vsystem,n1,client=34.150.235.219:56372,hostssl,user=root] 217 ={"Timestamp":1699871675640079685,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"physical_replication.consumer.node_lag_replanning_threshold\" = ‹'10m'›","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"physical_replication.consumer.node_lag_replanning_threshold","Value":"‹00:10:00›"}

If it were executed, we would have seen this log line node %d is %.2f minutes behind the next node. Try replanning

So, either all nodes were equally far behind (i.e. no node was more than 10 minutes behind every other node), or there's some bug in our replan policy.

@msbutler
Copy link
Collaborator

One node has a huge backlog of catchup scans, which means that one node is indeed lagging behind others:
image

@msbutler
Copy link
Collaborator

@adityamaru One node (node 2) is lagging behind on catchup scans here, which suggests that its hwm should be far behind all other nodes. To confirm, it would be great to see the replication_frontier.txt, but its not in the debug zip or artifacts. If I enabled a periodic collection of the replication frontier in all c2c roachtests, would the debug zip machinery collect it?

@msbutler msbutler added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 13, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.c2c/BulkOps/full failed with artifacts on master @ 7f96bac8ce44a128b1f98f5462369a9910c802e5:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:750
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1650
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:746
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:728
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1650
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@msbutler msbutler self-assigned this Nov 15, 2023
@msbutler
Copy link
Collaborator

I was able to repro the failure. For some reason, the replication-frontier.txt file is empty...

@adityamaru
Copy link
Contributor

For some reason, the replication-frontier.txt file is empty

Was the cluster setting to dump the frontier set to something other than 0?

@msbutler
Copy link
Collaborator

oh wait

@msbutler
Copy link
Collaborator

ok ,replication frontier looks fine. we should replan on node 2's lag, but it doesnt:

Src Instance Dest Instance Span                                                         Frontier Timestamp                      Behind By
1            2             /Tenant/2/Table/106/1/15{1014753-2029505}                    2023-11-15 15:40:29.57541861 +0000 UTC  27m10s
1            2             /Tenant/2/Table/106/1/1{69711297-85943911}                   2023-11-15 15:40:27.969296666 +0000 UTC 27m12s
1            2             /Tenant/2/Table/106/1/22{5000001-7029411}                    2023-11-15 15:40:27.774149789 +0000 UTC 27m12s
1            2             /Tenant/2/Table/106/1/3{78042979-94709447}                   2023-11-15 15:31:31.926283872 +0000 UTC 36m8s
1            2             /Tenant/2/Table/106/1/45{2029029-3043555}                    2023-11-15 15:40:29.375254051 +0000 UTC 27m10s
1            2             /Tenant/2/Table/106/1/4{27172420-43404961}                   2023-11-15 15:31:31.926283872 +0000 UTC 36m8s
1            2             /Tenant/2/Table/106/1/4{69711525-85942823}                   2023-11-15 15:32:09.142240862 +0000 UTC 35m30s
1            2             /Tenant/2/Table/106/1/7{7028741-8043111}                     2023-11-15 15:32:45.181294853 +0000 UTC 34m54s
1            2             /Tenant/2/Table/106/1/{1023174-3069568}                      2023-11-15 15:40:27.969296666 +0000 UTC 27m12s
1            2             /Tenant/2/Table/106/1/{19854022-36089410}                    2023-11-15 15:40:28.118166001 +0000 UTC 27m11s
1            2             /Tenant/2/Table/106/1/{3069568-19854022}                     2023-11-15 15:40:28.064543615 +0000 UTC 27m12s
1            2             /Tenant/2/Table/106/1/{52321121-68552704}                    2023-11-15 15:40:28.064543615 +0000 UTC 27m12s
1            2             /Tenant/2/Table/106/1/{68552704-75000001}                    2023-11-15 15:32:45.181294853 +0000 UTC 34m54s
1            2             /Tenant/2/Table/106/1/{94709889-143402727}                   2023-11-15 15:32:45.181294853 +0000 UTC 34m54s
1            2             /Tenant/2/Table/106/2/1{1308045/379643975-2014656/321960419} 2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/2/{4949782/427552132-5657155/164005090}  2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/2/{7070599/373073184-9895573/142776448}  2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/2/{718696/49504160-2828092/178715332}    2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/3/8{486/322776101-844/513185414}         2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/3/9{323/155295265-805/186177284}         2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/{1/589347074-2/1/27843492}               2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/{2/14867686/564436069-3/8371/181802916}  2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
2            3             /Tenant/2/Table/106/1/1{53044257-69711297}                   2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/277{172512-719973}                     2023-11-15 16:07:20.765557784 +0000 UTC 18.8s
2            3             /Tenant/2/Table/106/1/2{02172357-18404001}                   2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/2{18404001-25000001}                   2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/2{77719973-92825318}                   2023-11-15 16:07:21.972875964 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/30{1014500-2028999}                    2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/30{2028999-3043522}                    2023-11-15 16:07:21.972875964 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/37{7028645-8042979}                    2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/3{35943207-52174211}                   2023-11-15 16:07:21.97978201 +0000 UTC  17.6s
2            3             /Tenant/2/Table/106/1/4{43404961-50000001}                   2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/52{5000001-6014503}                    2023-11-15 16:07:21.97978201 +0000 UTC  17.6s
2            3             /Tenant/2/Table/106/1/54{0653637-1203073}                    2023-11-15 16:07:23.781526617 +0000 UTC 15.8s
2            3             /Tenant/2/Table/106/1/54{1203073-8462787}                    2023-11-15 16:07:21.972875964 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/556{304519-884517}                     2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/5{18405057-25000001}                   2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/5{48462787-56304519}                   2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/5{56884517-73115557}                   2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/7{6014371-7028741}                     2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/{292825318-301014500}                  2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/2/12{014656/321960419-720244/578910049}  2023-11-15 16:07:21.772673264 +0000 UTC 17.8s
2            3             /Tenant/2/Table/106/2/1{2720244/578910049-4160751/43724224}  2023-11-15 16:07:21.972875964 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/2/{5657155/164005090-6362552/502968865}  2023-11-15 16:07:21.572075397 +0000 UTC 18s
2            3             /Tenant/2/Table/106/3/8{371/181802916-486/322776101}         2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/3/9{805/186177284-924/571393728}         2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/{2/Table/106/3/10413/13894371-3}                     2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
3            1             /Tenant/2/Table/106/1/15{2029505-3044257}                    2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/1{-023174}                             2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/1{43402727-50000001}                   2023-11-15 16:07:24.982576296 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/260{361987-941701}                     2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/2{28044128-30798211}                   2023-11-15 16:07:24.957201003 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/2{30798211-44710023}                   2023-11-15 16:07:25.157539582 +0000 UTC 14.4s
3            1             /Tenant/2/Table/106/1/2{44710023-52519300}                   2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/2{52519300-60361987}                   2023-11-15 16:07:24.973114365 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/37{6014311-7028645}                    2023-11-15 16:07:24.982576296 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/3{52174211-68405217}                   2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/41{0940646-1471557}                    2023-11-15 16:07:27.985031386 +0000 UTC 11.6s
3            1             /Tenant/2/Table/106/1/45{1014503-2029029}                    2023-11-15 16:07:24.982576296 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/4{11471557-27172420}                   2023-11-15 16:07:24.973114365 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/52{6014503-7029029}                    2023-11-15 16:07:24.973114365 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/52{7029029-8043555}                    2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/5{28043555-39494084}                   2023-11-15 16:07:24.973114365 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/5{39494084-40653637}                   2023-11-15 16:07:24.982576296 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/{185943911-202172357}                  2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/2/4{239136/539234624-949782/427552132}   2023-11-15 16:07:27.988565601 +0000 UTC 11.6s
3            1             /Tenant/2/Table/106/2/{1/27843492-718696/49504160}           2023-11-15 16:07:27.778162906 +0000 UTC 11.8s
3            1             /Tenant/2/Table/106/2/{2828092/178715332-3534896/129169536}  2023-11-15 16:07:28.018700172 +0000 UTC 11.6s
3            1             /Tenant/2/Table/106/2/{3534896/129169536-4239136/539234624}  2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/3/104{04/84903972-13/13894371}           2023-11-15 16:07:25.173432415 +0000 UTC 14.4s
3            1             /Tenant/2/Table/106/3/10{164/174863682-404/84903972}         2023-11-15 16:07:24.957201003 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/3/8{844/513185414-963/576430853}         2023-11-15 16:07:24.957201003 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/3/9{203/417237639-323/155295265}         2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106{-/1/1}                                   2023-11-15 16:07:24.957201003 +0000 UTC 14.6s
3            1             /Tenant/2{-/Table/106}                                       2023-11-15 16:07:27.951796081 +0000 UTC 11.6s
4            4             /Tenant/2/Table/106/1/15{0000001-1014753}                    2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/22{7029411-8044128}                    2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/26{0941701-1503584}                    2023-11-15 16:07:29.79323158 +0000 UTC  9.8s
4            4             /Tenant/2/Table/106/1/2{61503584-77172512}                   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/37601431{0-1}                          2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/37{5000001-6014310}                    2023-11-15 16:07:29.786431561 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/3{03043522-19710177}                   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/3{19710177-35943207}                   2023-11-15 16:07:29.79323158 +0000 UTC  9.8s
4            4             /Tenant/2/Table/106/1/3{68405217-75000001}                   2023-11-15 16:07:29.79323158 +0000 UTC  9.8s
4            4             /Tenant/2/Table/106/1/410{360676-940646}                     2023-11-15 16:07:29.786431561 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/45{0000001-1014503}                    2023-11-15 16:07:29.786431561 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/4{02518566-10360676}                   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/4{53043555-69711525}                   2023-11-15 16:07:29.786431561 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/58{0923716-8767557}                    2023-11-15 16:07:29.79323158 +0000 UTC  9.8s
4            4             /Tenant/2/Table/106/1/58{8767557-9347074}                    2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/5{02173190-18405057}                   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/5{73115557-80923716}                   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/7{5000001-6014371}                     2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/{36089410-52321121}                    2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/{394709447-402518566}                  2023-11-15 16:07:29.79323158 +0000 UTC  9.8s
4            4             /Tenant/2/Table/106/1/{485942823-502173190}                  2023-11-15 16:07:29.786431561 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/{78043111-94709889}                    2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/2/14{160751/43724224-867686/564436069}   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/2/1{0602851/135163616-1308045/379643975} 2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/2/{6362552/502968865-7070599/373073184}  2023-11-15 16:07:27.966339919 +0000 UTC 11.6s
4            4             /Tenant/2/Table/106/2/{9895573/142776448-10602851/135163616} 2023-11-15 16:07:29.97979045 +0000 UTC  9.6s
4            4             /Tenant/2/Table/106/3/9{083/414592929-203/417237639}         2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/3/{8963/576430853-9083/414592929}        2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/3/{9924/571393728-10164/174863682}       2023-11-15 16:07:29.79323158 +0000 UTC  9.8s

@msbutler
Copy link
Collaborator

msbutler commented Nov 15, 2023

aha, i think I figured out the problem after rerunning the test with some extra logging: when we consider checking for lagging nodes, we always update sf.lastNodeLagCheck, even if we exit early from this check because we haven't waited enough time to check the nodes: timeutil.Since(sf.lastNodeLagCheck) < checkFreq. If we call maybeCheckForLaggingNodes frequently enough, then we will never consider replanning. This should be quick fix.

func (sf *streamIngestionFrontier) maybeCheckForLaggingNodes() error {
	defer func() {
		sf.lastNodeLagCheck = timeutil.Now()
	}()
	log.Infof(sf.Ctx(), "Consider replanning check")
	checkFreq := streamingccl.ReplanFrequency.Get(&sf.FlowCtx.Cfg.Settings.SV)
	maxLag := streamingccl.InterNodeLag.Get(&sf.FlowCtx.Cfg.Settings.SV)
	if checkFreq == 0 || maxLag == 0 || timeutil.Since(sf.lastNodeLagCheck) < checkFreq {
		log.Infof(sf.Ctx(), "Don't replan: maxLag %d; checkFreq %.2f; last node check %s; time since last check %.2f",
			maxLag, checkFreq.Minutes(), sf.lastNodeLagCheck, timeutil.Since(sf.lastNodeLagCheck).Minutes())
		return nil
	}

msbutler added a commit to msbutler/cockroach that referenced this issue Nov 15, 2023
This patch prevents the lastNodeLagCheck time from updating every time the
frontier processor receives a checkpoint, which can happen every few seconds.
This previously prevented the node lag replanning check to trigger because this
time needed to be older than 10 minutes. Rather, this timestamp should only
update if we actually compute the lag check.

Fixes cockroachdb#114341

Release note: none
@cockroach-teamcity
Copy link
Member Author

roachtest.c2c/BulkOps/full failed with artifacts on master @ e19c24fb62d24595e74c0bae0aaad0a736c2bdc7:

(test_runner.go:1141).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.c2c/BulkOps/full failed with artifacts on master @ aa812a63b8023d35b7bd62d12c6c47708f48a817:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:750
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1650
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:746
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:728
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1650
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Nov 16, 2023
114525: streamingccl: prevent node lag replanning starvation r=stevendanna a=msbutler

This patch prevents the lastNodeLagCheck time from updating every time the frontier processor receives a checkpoint, which can happen every few seconds. This previously prevented the node lag replanning check to trigger because this time needed to be older than 10 minutes. Rather, this timestamp should only update if we actually compute the lag check.

Fixes #114341

Release note: none

Co-authored-by: Michael Butler <[email protected]>
@craig craig bot closed this as completed in b61cbb9 Nov 16, 2023
blathers-crl bot pushed a commit that referenced this issue Nov 16, 2023
This patch prevents the lastNodeLagCheck time from updating every time the
frontier processor receives a checkpoint, which can happen every few seconds.
This previously prevented the node lag replanning check to trigger because this
time needed to be older than 10 minutes. Rather, this timestamp should only
update if we actually compute the lag check.

Fixes #114341

Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants