roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [stalled rangefeed] #119333

cockroach-teamcity · 2024-02-17T12:44:19Z

roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on release-23.2.1-rc @ 898cd6a363fd47bb92a03bac216f9bed0f64bc08:

(latency_verifier.go:199).assertValid: max latency was more than allowed: 17m13.946417082s vs 2m0s
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/tpcc/warehouses=1000/duration=60/cutover=30/run_1

Parameters:

ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=8
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #119061 roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [C-test-failure O-roachtest O-robot P-2 T-disaster-recovery branch-release-23.2 release-blocker]

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-36155}

The text was updated successfully, but these errors were encountered:

msbutler · 2024-02-21T23:07:26Z

Node 3 got stuck, and the replanner did not kick in because it made initial progress:

5.unredacted/cockroach.log:I240217 11:40:16.310203 44966 ccl/streamingccl/streamingest/stream_ingestion_frontier_processor.go:571 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=944042679120232450›,stream-ingest-distsql,f‹22f5b763›,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-944042679120232450›] 907  detected a lagging node: node 3 is 10.66 minutes behind the next node. Try replanning: node frontier too far behind other nodes. Don't forward error because replicated time at last check 0,0 is less than current replicated time 1708169362.163534125,0

This happened again an hour later on node 4

5.unredacted/cockroach.log:I240217 12:40:16.877005 44966 ccl/streamingccl/streamingest/stream_ingestion_frontier_processor.go:571 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=944042679120232450›,stream-ingest-distsql,f‹22f5b763›,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-944042679120232450›] 1430  detected a lagging node: node 4 is 17.00 minutes behind the next node. Try replanning: node frontier too far behind other nodes. Don't forward error because replicated time at last check 0,0 is less than current replicated time 1708172582.161500341,0

msbutler · 2024-02-21T23:36:15Z

Looks like a rangefeed on source node 3 stalled for 17 minutes with this fun new error, which was continuously logged from 12:23 to 12:40, which lines up with the 17 minute delay logged by the lag checker:

~/Downloads/artifacts_slow_c2c/logs/3.unredacted
❯ grep "kv/kvserver/rangefeed/task.go:245" cockroach.log | head -n 1
E240217 12:23:06.192206 11736095 kv/kvserver/rangefeed/task.go:245 ⋮ [T1,Vsystem,n3,s3,r229/4:‹/Tenant/2/Table/109/1/2{6…-8…}›,rangefeed] 1784  pushing old intents failed: range barrier failed, range split: key range /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹284›/‹"H\xf2\x8f\xd4\xf4\xb9L\x00\x80\x00\x00\x00\x00\x82p\xb8"› outside of bounds of range /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹273›/‹"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"›; suggested ranges: [‹desc: r229:/Tenant/2/Table/109/1/2{61/"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("-73/"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"} [(n3,s3):4, (n1,s1):5, (n4,s4):6, next=7, gen=31, sticky=1708172411.570988803,0], lease: <empty>, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING›]

~/Downloads/artifacts_slow_c2c/logs/3.unredacted                                                                                                                                                                                            ✔ PIPE|0
❯ grep "kv/kvserver/rangefeed/task.go:245" cockroach.log | tail -n 1
E240217 12:40:23.185617 15466593 kv/kvserver/rangefeed/task.go:245 ⋮ [T1,Vsystem,n3,s3,r229/4:‹/Tenant/2/Table/109/1/2{6…-8…}›,rangefeed] 3063  pushing old intents failed: range barrier failed, range split: key range /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹284›/‹"H\xf2\x8f\xd4\xf4\xb9L\x00\x80\x00\x00\x00\x00\x82p\xb8"› outside of bounds of range /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹273›/‹"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"›; suggested ranges: [‹desc: r229:/Tenant/2/Table/109/1/2{61/"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("-73/"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"} [(n3,s3):4, (n1,s1):5, (n4,s4):6, next=7, gen=31, sticky=1708172411.570988803,0], lease: <empty>, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING›]

msbutler · 2024-02-21T23:41:20Z

looks like this error was recently introduced by #118413 by @erikgrinaker . Since this looks like a rangefeed bug, I imagine this could cause a changefeed to stall. Adding a release blocker.

msbutler · 2024-02-22T00:07:50Z

hrm. While I'm pretty sure this range barrier error caused the c2c stream to stall, given the timing, one wrinkle I should note is that the c2c logging indicated that destination node 4 was lagging behind. But according to streaming_ingestion_dist logging, dest node 4 was paired with src node 2:

5.unredacted/cockroach.log:I240217 11:29:35.463107 41017 ccl/streamingccl/streamingest/stream_ingestion_dist.go:744 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=944042679120232450›] 434  physical replication src-dst pair candidate: ‹2› (locality ‹cloud=gce,region=us-east1,zone=us-east1-b›) - 4 (locality ‹cloud=gce,region=us-east1,zone=us-east1-b›)

I suspect the rangefeed in the kv server on node 3 was then passed to dest node 4 via src node 2, but I'm not totally sure.

erikgrinaker · 2024-02-22T08:58:36Z

I think this indicates that the range was split, but the split wasn't applied to the follower. I'll have a closer look when I have a chance.

erikgrinaker · 2024-02-22T10:31:59Z

This happened after r231 merged into r229:

I240217 12:23:05.160686 17579550 kv/kvserver/replica_command.go:718 ⋮ [T1,Vsystem,n1,merge,s1,r229/5:‹/Tenant/2/Table/109/1/2{6…-7…}›] 2788
  initiating a merge of r231:‹/Tenant/2/Table/109/1/2{73/"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"-84/"H\xf2\x8f\xd4\xf4\xb9L\x00\x80\x00\x00\x00\x00\x82p\xb8"}› [(n1,s1):1, (n3,s3):4, (n4,s4):5, next=6, gen=28, sticky=1708172411.570988803,0]
  into this range (‹lhs+rhs size (32 MiB+32 MiB=63 MiB) below threshold (128 MiB) lhs+rhs cpu (4ms+7ms=11ms) below threshold (250ms)›)
I240217 12:23:05.172789 332 kv/kvserver/store_remove_replica.go:157 ⋮ [T1,Vsystem,n3,s3,r229/4:‹/Tenant/2/Table/109/1/2{6…-7…}›,raft] 1783  removing replica r231/4
E240217 12:23:06.192206 11736095 kv/kvserver/rangefeed/task.go:245 ⋮ [T1,Vsystem,n3,s3,r229/4:‹/Tenant/2/Table/109/1/2{6…-8…}›,rangefeed] 1784
  pushing old intents failed: range barrier failed, range split:
  key range /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹284›/‹"H\xf2\x8f\xd4\xf4\xb9L\x00\x80\x00\x00\x00\x00\x82p\xb8"› outside of bounds of range
  /Tenant/2/Table/109/1/‹261›/‹"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("›-/Tenant/2/Table/109/1/‹273›/‹"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"›;
  suggested ranges: [‹desc: r229:/Tenant/2/Table/109/1/2{61/"B\xfb\x85F\xfc=D\x00\x80\x00\x00\x00\x00w\xc6("-73/"E\xf7\n\x8d\xf8{H\x00\x80\x00\x00\x00\x00}\x1bp"} [(n3,s3):4, (n1,s1):5, (n4,s4):6, next=7, gen=31, sticky=1708172411.570988803,0], lease: <empty>, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING›]'

The barrier is sent with the correct, post-merge bounds, but the server sees the old, pre-merge bounds and rejects the request. Why?

erikgrinaker · 2024-02-22T10:37:58Z

The barrier request doesn't bypass the lease check or anything, so it should be evaluated on the leaseholder, which should have applied the merge already:

cockroach/pkg/kv/kvpb/api.go

Lines 1776 to 1782 in e494351

    
           func (r *BarrierRequest) flags() flag { 
        
           	flags := isWrite | isRange | isAlone 
        
           	if r.WithLeaseAppliedIndex { 
        
           		flags |= isUnsplittable // the LAI is only valid for a single range 
        
           	} 
        
           	return flags 
        
           }

erikgrinaker · 2024-02-22T10:40:03Z

Could this be a tenant thing or something, where the bounds aren't propagated to the tenant yet? 🤔 Wouldn't immediately think so, since this runs in KV.

Haven't seen this happen in the wild, just checked CC.

erikgrinaker · 2024-02-22T10:43:20Z

All three nodes applied the merge immediately:

I240217 12:23:05.172392 394 kv/kvserver/store_remove_replica.go:157 ⋮ [T1,Vsystem,n1,s1,r229/5:‹/Tenant/2/Table/109/1/2{6…-7…}›,raft] 2789  removing replica r231/1
I240217 12:23:05.172789 332 kv/kvserver/store_remove_replica.go:157 ⋮ [T1,Vsystem,n3,s3,r229/4:‹/Tenant/2/Table/109/1/2{6…-7…}›,raft] 1783  removing replica r231/4
I240217 12:23:05.172585 310 kv/kvserver/store_remove_replica.go:157 ⋮ [T1,Vsystem,n4,s4,r229/6:‹/Tenant/2/Table/109/1/2{6…-7…}›,raft] 1485  removing replica r231/5

erikgrinaker · 2024-02-22T11:33:01Z

Secondary question: why do we keep trying to push this transaction? Are we not seeing the intent resolution replicated, which would remove it from tracking? Maybe because we don't emit any of the events here when the barrier fails or something. Will look into it after I fix the primary problem.

erikgrinaker · 2024-02-22T13:07:27Z

I have no less than 3 prototype fixes, we'll need to gauge backport safety.

Secondary question: why do we keep trying to push this transaction? Are we not seeing the intent resolution replicated, which would remove it from tracking?

I think that's exactly the problem: this probably was an aborted transaction, whose txn record was GCed, but there is nothing actually causing the intents to get GCed. We would do so after sending the barrier, but we error out before that.

I'll submit a PR to attempt to resolve intents even if the barrier fails, which should be a bit more robust.

erikgrinaker · 2024-02-22T15:53:49Z

And another candidate fix:

kvpb: make BarrierRequest splittable #119520

erikgrinaker · 2024-02-22T16:43:41Z

After discussing, we're going to go with #119512. We just have to confirm that this will reliably update the range cache first, and add another test.

cockroach-teamcity added this to the 23.2 milestone Feb 17, 2024

cockroach-teamcity mentioned this issue Feb 18, 2024

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #119346

Closed

dt assigned msbutler Feb 20, 2024

dt added P-1 Issues/test failures with a fix SLA of 1 month and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 20, 2024

msbutler added P-0 Issues/test failures with a fix SLA of 2 weeks and removed P-1 Issues/test failures with a fix SLA of 1 month labels Feb 21, 2024

msbutler added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team and removed T-disaster-recovery labels Feb 21, 2024

stevendanna changed the title ~~roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed~~ roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [stalled rangefeed] Feb 22, 2024

stevendanna mentioned this issue Feb 22, 2024

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #119061

Closed

erikgrinaker self-assigned this Feb 22, 2024

This was referenced Feb 22, 2024

kvcoord: evict range cache on unsplittable batches #119510

Closed

kvserver: refresh range cache on rangefeed barrier failure #119512

Merged

kvserver: evict range cache on rangefeed barrier mismatch #119514

Closed

erikgrinaker mentioned this issue Feb 22, 2024

rangefeed: attempt to resolve intents on barrier failure #119515

Closed

cockroach-teamcity mentioned this issue Feb 22, 2024

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed #119517

Closed

msbutler removed their assignment Feb 22, 2024

erikgrinaker mentioned this issue Feb 22, 2024

kvpb: make BarrierRequest splittable #119520

Closed

wenyihu6 mentioned this issue Feb 22, 2024

ccl/changefeedccl: TestChangefeedSchemaChangeBackfillCheckpoint failed #119375

Closed

erikgrinaker mentioned this issue Feb 22, 2024

kvserver: rangefeed txn pusher barrier may cause resolved timestamp stalls #119536

Closed

craig bot closed this as completed in 9d55e4b Feb 22, 2024

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-rangefeed Rangefeed infrastructure, server+client labels Feb 23, 2024

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [stalled rangefeed] #119333

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [stalled rangefeed] #119333

cockroach-teamcity commented Feb 17, 2024 •

edited by cockroach-jira-scripts

Loading

msbutler commented Feb 21, 2024 •

edited

Loading

msbutler commented Feb 21, 2024 •

edited

Loading

msbutler commented Feb 21, 2024

msbutler commented Feb 22, 2024 •

edited

Loading

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024 •

edited

Loading

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024 •

edited

Loading

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [stalled rangefeed] #119333

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [stalled rangefeed] #119333

Comments

cockroach-teamcity commented Feb 17, 2024 • edited by cockroach-jira-scripts Loading

msbutler commented Feb 21, 2024 • edited Loading

msbutler commented Feb 21, 2024 • edited Loading

msbutler commented Feb 21, 2024

msbutler commented Feb 22, 2024 • edited Loading

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024 • edited Loading

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024 • edited Loading

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024

erikgrinaker commented Feb 22, 2024

cockroach-teamcity commented Feb 17, 2024 •

edited by cockroach-jira-scripts

Loading

msbutler commented Feb 21, 2024 •

edited

Loading

msbutler commented Feb 21, 2024 •

edited

Loading

msbutler commented Feb 22, 2024 •

edited

Loading

erikgrinaker commented Feb 22, 2024 •

edited

Loading

erikgrinaker commented Feb 22, 2024 •

edited

Loading