Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: index backfill never completes with ReplAC apply_to_all #135343

Open
andrewbaptist opened this issue Nov 15, 2024 · 1 comment
Open

kv: index backfill never completes with ReplAC apply_to_all #135343

andrewbaptist opened this issue Nov 15, 2024 · 1 comment
Labels
A-replication-admission-control-v2 Related to introduction of replication AC v2 branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-perturbation Bugs found by the perturbation framework T-kv KV Team

Comments

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Nov 15, 2024

Describe the problem

While running the backfill test, the backfill hung and nodes became stuck waiting for a snapshot.

To Reproduce

Run the following test:

PERTURBATION_OVERRIDE=acMode=fullBoth roachtest run perturbation/full/backfill

Additionally this test reproduces the issue as well:
#135339

Additional data / screenshots

The error in the logs is:

E241115 22:00:40.609821 23415709 kv/kvserver/queue.go:1198 ⋮ [T1,Vsystem,n3,raftsnapshot,s6,r7801/4:‹/Table/109/1/-781{9715…-7870…}›] 535505  error sending couldn't accept ‹range_id:7801 coordinator_replica:<node_id:3 store_id:6 replica_id:4 type:VOTER_FULL > recipient_replica:<node_id:12 store_id:24 replica_id:1 type:VOTER_FULL > delegated_sender:<node_id:3 store_id:6 replica_id:4 type:VOTER_FULL > term:7 first_index:11993 sender_queue_name:RAFT_SNAPSHOT_QUEUE descriptor_generation:95 queue_on_delegate_len:-1 snap_id:9e4c2549-8a9c-4d99-8d92-99594f668bd8 ›: (n12,s24):1: remote couldn't accept snapshot 9e4c2549 at applied index 11993: ‹snapshot intersects existing range; initiated GC:› [n12,s24,r7924/4:‹/Table/109/1/-78{2340…-1418…}›] (incoming ‹/Table/109/1/-781{9715178531312532-7870688572937416}›)

This repeats at a high rate (~100/s)

Cluster link

Jira issue: CRDB-44457

Epic CRDB-42900

@andrewbaptist andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 15, 2024
Copy link

blathers-crl bot commented Nov 15, 2024

Hi @andrewbaptist, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@andrewbaptist andrewbaptist added branch-master Failures and bugs on the master branch. T-kv KV Team A-replication-admission-control-v2 Related to introduction of replication AC v2 labels Nov 15, 2024
@andrewbaptist andrewbaptist added the O-perturbation Bugs found by the perturbation framework label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-replication-admission-control-v2 Related to introduction of replication AC v2 branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-perturbation Bugs found by the perturbation framework T-kv KV Team
Projects
None yet
Development

No branches or pull requests

1 participant