Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Retryable request accepted by leader but rejected by followers #23523

Open
1 task done
Huqicheng opened this issue Aug 16, 2024 · 1 comment
Open
1 task done
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@Huqicheng
Copy link
Contributor

Huqicheng commented Aug 16, 2024

Jira Link: DB-12439

Description

We found this from stress test:

Symptom

Leader has accepted the retryable request but when replicating it to followers, the same request was rejected by followers due to Duplicate request error. It can have two different symptoms:

  1. Follower1: Node 2d428f411bdb446a8aefbeb752161cea rejected the request (15.182395) because it's duplicate with a pending operation. In this case, tserver will crash because NotifyReplicationFinished called for multiple times. see [DocDB] SIGSEGV in yb-tserver 'NotifyReplicationFinished'. Multiple ReplicationFinished() callbacks triggered in retryable requests #23501
yb-tserver.ip-172-151-23-232.us-west-2.compute.internal.yugabyte.log.INFO.20240814-070504.1444.gz:W0814 07:14:46.291299  1602 raft_consensus.cc:2100] T 32ce0133ec89495d86058db37c32376d P 2d428f411bdb446a8aefbeb752161cea [term 15 FOLLOWER]: StartReplicaOperationUnlocked failed: Illegal state (yb/consensus/replica_state.cc:733): Cannot register retryable request on follower
yb-tserver.ip-172-151-23-232.us-west-2.compute.internal.yugabyte.log.INFO.20240814-070504.1444.gz:W0814 07:14:46.291330  1602 raft_consensus.cc:2117] T 32ce0133ec89495d86058db37c32376d P 2d428f411bdb446a8aefbeb752161cea [term 15 FOLLOWER]: Could not prepare operation for op: 15.182395. Suppressed 286 other warnings. Status for this op: Illegal state (yb/consensus/replica_state.cc:733): Cannot register retryable request on follower
  1. Follower2: Node 1a18c2d42cc646fabaeb29f3dcc8f593 rejected the request (15.182395) because its duplicate with a replicated operation. In this case, raft can't replicate any operations until Duplicate request 2881287 get cleaned.
0814 07:10:58.660213 35479 raft_consensus.cc:2100] T 32ce0133ec89495d86058db37c32376d P 1a18c2d42cc646fabaeb29f3dcc8f593 [term 15 FOLLOWER]: StartReplicaOperationUnlocked failed: Illegal state (yb/c
onsensus/retryable_requests.cc:384): Cannot register retryable request on follower: Duplicate request 2881287 from client ee06f9fb-6db4-43bf-8d56-7fba98052c2a (min running 2881282)
W0814 07:10:58.660241 35479 raft_consensus.cc:2117] T 32ce0133ec89495d86058db37c32376d P 1a18c2d42cc646fabaeb29f3dcc8f593 [term 15 FOLLOWER]: Could not prepare operation for op: 15.182395. Suppressed
 0 other warnings. Status for this op: Illegal state (yb/consensus/retryable_requests.cc:384): Cannot register retryable request on follower: Duplicate request 2881287 from client ee06f9fb-6db4-43bf-
8d56-7fba98052c2a (min running 2881282)

Mitigation

This issue should go away when the duplicate requests get cleaned after reaching the retryable request timeout (600secs for ysql and 60s for ycql).

RCA

In progress

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@Huqicheng Huqicheng added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Aug 16, 2024
@Huqicheng Huqicheng self-assigned this Aug 16, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Aug 16, 2024
@rthallamko3 rthallamko3 removed the status/awaiting-triage Issue awaiting triage label Aug 16, 2024
@Huqicheng
Copy link
Contributor Author

Some findings:
Assuming ts1 is the leader, ts2 is the follower.

Leader ts1 should crash at around 07:02:13.
Ts2 cleanup the duplicate request at 07:12:13, the request timeout is 10mins, so the duplicate request should be one of the last few requests accepted by the leader at around 07:02:13.

After ts1 restarted, it was elected as leader again.

yb-tserver.ip-172-151-17-9.us-west-2.compute.internal.yugabyte.log.INFO.20240814-070302.1097:I0814 07:10:53.436951  1254 tablet_bootstrap_if.cc:96] T 32ce0133ec89495d86058db37c32376d P dd3f5ec78e7b4dee9e2532fafda15a24: Bootstrap replayed 34/34 log segments. Read operations: 182392, overwritten operations: 0. Pending: 1 replicates. Last read committed op id: 14.182391, last entry metadata: { entry_time: 98831.366s offset: 21711061 active_segment_sequence_number: 34 }, last read entry op id: 14.182392

yb-tserver.ip-172-151-17-9.us-west-2.compute.internal.yugabyte.log.INFO.20240814-070302.1097:I0814 07:10:56.667130  1539 replica_state.cc:796] T 32ce0133ec89495d86058db37c32376d P dd3f5ec78e7b4dee9e2532fafda15a24 [term 15 LEADER]: Advanced the committed_op_id across terms. Last committed operation was: 14.182391 New committed index is: 15.182393

From the LMP error message, follower ts2 should have replicated 182394.

yb-tserver.ip-172-151-17-9.us-west-2.compute.internal.yugabyte.log.INFO.20240814-070302.1097:I0814 07:10:59.171384  1286 consensus_queue.cc:1309] T 32ce0133ec89495d86058db37c32376d P dd3f5ec78e7b4dee9e2532fafda15a24 [LEADER]: Got LMP mismatch error from peer: { peer: 1a18c2d42cc646fabaeb29f3dcc8f593 is_new: 0 last_received: 15.182394 next_index: 182395 last_known_committed_idx: 182394 is_last_exchange_successful: 0 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 3 last_applied: 15.182394 last_successful_communication_time: 0ms ago}

OP 182395 has the duplicate request id 2881287 but was accepted by the leader ts1.

Some possibilities:

  1. The old request 2881287 got cleaned at leader side.
    1.1 Time-based GC, should be GCed after 10 mins.
    Impossible, because the OP should be registered at around 07:02:13 but the duplicate request was accepted at around 07:10:59, which is less than 10mins
    1.2 min running id-based GC, a request should be GCed when client reports a min running id greater than its request id.
    Impossible, because the duplicate request should be rejected due to Expired request id.
  2. The old request 2881287 was not added to RetryableRequests at leader side.
    Maybe..
  3. Client which wrote request 2881287 got evicted so that we can't find the min running id.
    Unlikely, GC algorithm will retain the client for 10 more minutes when it finds the client has 0 requests.

Huqicheng added a commit that referenced this issue Aug 20, 2024
Summary:
Retryable request accepted by leader but rejected by followers, which is unexpected.
This diff adds more debug info when this happens.
Jira: DB-12439

Test Plan: Jenkins

Reviewers: sergei, rthallam, timur

Reviewed By: timur

Subscribers: ybase, qhu

Differential Revision: https://phorge.dev.yugabyte.com/D37335
jasonyb pushed a commit that referenced this issue Aug 22, 2024
Summary:
 0fa2b24 Fix table layout to utilize maximum available space (#23564)
 336d00d [PLAT-14981]Increase default slow  query length
 e5127f8 [DOC-445] TA-22935: Potential Issues with Server-side Sequence Caching in Multi-Database Clusters (#23520)
 b3389ff [#23493] xCluster: code for ensuring there's an update for every sequence in WAL
 cb26a09 [#23548] Tools: Clean-up sys-catalog-tool code
 28025f6 [docs] Visualize migration assessment updates (#23358)
 9c0de5d [#23257] YSQL: Change conflict error string for RC transactions
 507432b [#23523] docdb: retryable requests instrumentation
 e4645e5 [DOC-431] Added a note for GKE cluster docs (#23349)
 d1576c4 [PLAT-12905] Add HA Metrics Page
 02da1f0 [PLAT-14869][PLAT-14986][PLAT-14998][PLAT-15003] - ui improvements and fixes
 c149f26 [#23556] hnsw_tool command-line tool for testing HNSW index implementations
 7725f15 Add operator mode & task info to diagnostics
 e193fc6 Revert "[#23064] YSQL: pg_partman: disable p_retention_schema parameter"
 8178372 [#23513] YSQL: Simplify several functions in ybc_pggate
 22657da [#23394] CDCSDK: Prevent tserver crash on concurrent Getchanges call on same producer tablet
 90554b0 [#23179] CDCSDK: Refactor TestPgReplicationSlot for dynamic data types

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Differential Revision: https://phorge.dev.yugabyte.com/D37453
@rthallamko3 rthallamko3 assigned rthallamko3 and unassigned Huqicheng Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants