forked from yugabyte/yugabyte-db
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[yugabyte#18744] DocDB: Add ability to recover from follower lag caus…
…ed by stuck OutboundCall Summary: In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D27859
- Loading branch information
Showing
19 changed files
with
307 additions
and
48 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.