[YSQL] Intensive workload with parallel UPDATE/DELETE all rows leads to restarting Postgres process #11349

qvad · 2022-02-03T16:45:47Z

Jira Link: DB-6047

Description

While running sample-app SqlDataLoad with additional operations (DELETE all presented rows and UPDATE all presented rows, see code in the attachments) few FATALs started to appear in client code. tserver and master logs does not contain any failures and looks like it's Postgres process that failed.

Workload w/o additional operations doesn't contain these exceptions

Client log:

340697 [Thread-106] INFO com.yugabyte.sample.apps.AppBase  - Going to retrieve connection again: FATAL: Network error: recvmsg error: Connection refused
	at com.yugabyte.Driver.makeConnection(Driver.java:496)
	at com.yugabyte.Driver.connect(Driver.java:284)
340697 [Thread-137] INFO com.yugabyte.sample.apps.AppBase  - Going to retrieve connection again: FATAL: Network error: recvmsg error: Connection refused
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:677)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189)
...
	at com.yugabyte.sample.common.IOPSThread.run(IOPSThread.java:87)
340883 [Thread-39] INFO com.yugabyte.sample.apps.AppBase  - Going to retrieve connection again: FATAL: the database system is shutting down
com.yugabyte.util.PSQLException: FATAL: the database system is shutting down

Here is one of Postgres log files: postgres-log.txt

Reproducer:
Example of run code

java -jar yb-sample-apps.jar --workload SqlDataLoadWithDDL  --num_writes -1 --num_threads_write 100 --num_threads_read 100 --num_reads -1 --num_unique_keys 300000000 --num_value_columns 20 --ddl_operations DROP_ROWS,UPDATE_ROWS --nodes ...

SqlDataLoadWithDDL.java.zip

The text was updated successfully, but these errors were encountered:

bmatican · 2022-02-03T16:54:01Z

Might be that the client doesn't retry certain types of errors (at all or enough). Will talk to Timur next week.

qvad · 2022-02-08T08:12:59Z

Full logs from recent quick run.
archive-2.zip

…nd ConsensusRound callbacks invokation Summary: **Background** Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs. Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure. When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure. On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side. But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this). **Problem** On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario: 1) Tablet `T` is split into `T1` and `T2`. 1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`. 3) Tablet `T1` is split into `T1.1` and `T1.2` 4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2` 5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers. 6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`. 7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers. 8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216". **Solution** The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again. Also, some other fixes/improvements have been made: - Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed. - Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error. - Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address. - Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction. - Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds. - Refactored `OperationDriver::Init` to exit early in case of error. - Enhanced `CatalogManager::RegisterNewTabletForSplit` logging. - Fixed `OperationTracker::Add` null pointer handling. - Added test. Test Plan: TabletSplitITest.SplitClientRequestsClean Reviewers: sergei Reviewed By: sergei Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D15484

…g at follower and ConsensusRound callbacks invocation Summary: **Background** Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs. Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure. When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure. On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side. But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this). **Problem** On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario: 1) Tablet `T` is split into `T1` and `T2`. 1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`. 3) Tablet `T1` is split into `T1.1` and `T1.2` 4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2` 5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers. 6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`. 7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers. 8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216". **Solution** The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again. Also, some other fixes/improvements have been made: - Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed. - Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error. - Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address. - Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction. - Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds. - Refactored `OperationDriver::Init` to exit early in case of error. - Enhanced `CatalogManager::RegisterNewTabletForSplit` logging. - Fixed `OperationTracker::Add` null pointer handling. - Added test. Original commit: d26017d / D15484 Test Plan: TabletSplitITest.SplitClientRequestsClean Reviewers: jmeehan Reviewed By: jmeehan Subscribers: jmeehan, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D17854

…ing at follower and ConsensusRound callbacks invokation Summary: **Background** Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs. Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure. When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure. On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side. But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this). **Problem** On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario: 1) Tablet `T` is split into `T1` and `T2`. 1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`. 3) Tablet `T1` is split into `T1.1` and `T1.2` 4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2` 5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers. 6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`. 7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers. 8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216". **Solution** The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again. Also, some other fixes/improvements have been made: - Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed. - Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error. - Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address. - Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction. - Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds. - Refactored `OperationDriver::Init` to exit early in case of error. - Enhanced `CatalogManager::RegisterNewTabletForSplit` logging. - Fixed `OperationTracker::Add` null pointer handling. - Added test. Original commit: d26017d / D15484 Test Plan: TabletSplitITest.SplitClientRequestsClean Reviewers: amitanand Reviewed By: amitanand Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D18036

… at follower and ConsensusRound callbacks invocation Summary: **Background** Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs. Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure. When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure. On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side. But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this). **Problem** On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario: 1) Tablet `T` is split into `T1` and `T2`. 1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`. 3) Tablet `T1` is split into `T1.1` and `T1.2` 4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2` 5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers. 6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`. 7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers. 8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216". **Solution** The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again. Also, some other fixes/improvements have been made: - Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed. - Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error. - Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address. - Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction. - Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds. - Refactored `OperationDriver::Init` to exit early in case of error. - Enhanced `CatalogManager::RegisterNewTabletForSplit` logging. - Fixed `OperationTracker::Add` null pointer handling. - Added test. Original commit: d26017d / D15484 Test Plan: TabletSplitITest.SplitClientRequestsClean Reviewers: amitanand Reviewed By: amitanand Subscribers: ybase, bogdan, jmeehan Differential Revision: https://phabricator.dev.yugabyte.com/D18127

…ing at follower and ConsensusRound callbacks invocation Summary: **Background** Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs. Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure. When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure. On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side. But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this). **Problem** On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario: 1) Tablet `T` is split into `T1` and `T2`. 1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`. 3) Tablet `T1` is split into `T1.1` and `T1.2` 4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2` 5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers. 6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`. 7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers. 8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216". **Solution** The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again. Also, some other fixes/improvements have been made: - Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed. - Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error. - Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address. - Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction. - Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds. - Refactored `OperationDriver::Init` to exit early in case of error. - Enhanced `CatalogManager::RegisterNewTabletForSplit` logging. - Fixed `OperationTracker::Add` null pointer handling. - Added test. Original commit: d26017d / D15484 Test Plan: TabletSplitITest.SplitClientRequestsClean Reviewers: mbautin, amitanand Reviewed By: amitanand Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D18116

…ing at follower and ConsensusRound callbacks invocation Summary: **Background** Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs. Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure. When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure. On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side. But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this). **Problem** On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario: 1) Tablet `T` is split into `T1` and `T2`. 1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`. 3) Tablet `T1` is split into `T1.1` and `T1.2` 4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2` 5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers. 6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`. 7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers. 8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216". **Solution** The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again. Also, some other fixes/improvements have been made: - Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed. - Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error. - Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address. - Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction. - Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds. - Refactored `OperationDriver::Init` to exit early in case of error. - Enhanced `CatalogManager::RegisterNewTabletForSplit` logging. - Fixed `OperationTracker::Add` null pointer handling. - Added test. Original commit: d26017d / D15484 Test Plan: TabletSplitITest.SplitClientRequestsClean Reviewers: amitanand Reviewed By: amitanand Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D18150

ttyusupov · 2022-07-11T14:27:20Z

Summary:
Background

Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs.
Request ID is assigned by YBClient based on TabletRequests::request_id_seq and TServer tracks running requests IDs in RetryableRequests structure.

When we split tablet, each of its child tablets gets copy of its Raft log and also RetryableRequests structure.
On YBClient side, child tablets get TabletRequests::request_id_seq value from the parent tablet to be consistent with TServer side.

But if process hosting YBClient has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about TabletRequests::request_id_seq for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If YBClient (MetaCache) doesn't know about the parent tablet, it use special value as request_id_seq for child tablet. Then, on getting "request id is less than min" error - it sets request_id_seq to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this).

Problem

On TServer side MaintenanceManager periodically triggers RetryableRequests structure cleanup to remove IDs of expired requests that are older than retryable_request_timeout_secs ago. And due to overloading or having too much tablets, it can happen that at tablet leader RetryableRequests structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario:

Tablet T is split into T1 and T2.
YBClient A starts doing some writes to tablet T1. Since A doesn't know about tablet T, it sets request_id_seq to min_running_request_id for A from the T leader (0) plus 1<<24. After 100 writes it becomes 100 + 1 << 24.
Tablet T1 is split into T1.1 and T1.2
Tablet T1.1 is split into T1.1.1 and T1.1.2
RetryableRequests cleanup is done at the leader of T1.1, but not at followers.
YBClient A starts doing write to tablet T1.1.2. Since A doesn't know about tablet T1.1, it sets request_id_seq to min_running_request_id for A from T1.1 leader (0 due to cleanup) plus 1<<24, that is 1<<24.
T1.1 leader accepts write with request_id = 1 <<24 and tries to replicate it at followers.
At follower, min_running_request_id is 100 + 1 << 24 (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216".

Solution

The solution is to update ReplicaState::AddPendingOperation to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again.

Also, some other fixes/improvements have been made:

Updated RetryableRequests::Register to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed.
Updated RaftConsensus::StartConsensusOnlyRoundUnlocked to invoke callback in case of error.
Updated YBClient logging to have prefix with both client ID and MetaCache address.
Added sanity check to avoid duplicate ConsensusRoundCallback calls and to verify that callback is always got called before destruction.
Updated RaftConsensus::AppendNewRoundsToQueueUnlocked to reduce code duplication and made it responsible for invoking callback for processed rounds.
Refactored OperationDriver::Init to exit early in case of error.
Enhanced CatalogManager::RegisterNewTabletForSplit logging.
Fixed OperationTracker::Add null pointer handling.
Added test.

ttyusupov · 2022-07-11T14:32:42Z

Without dynamic tablet splitting enabled it can also happen that YBClient inside YQL layer will retry RPC to tserver within client_read_write_timeout_ms or ysql_client_read_write_timeout_ms for YSQL. And if retrying original tserver RPC happens after retryable_request_timeout_secs (120 seconds by default), we will have the same issue failing to apply duplicate WRITE_OP at the follower even that at leader side it wasn't treated as duplicate and has been successfully applied.

qvad added the area/ysql Yugabyte SQL (YSQL) label Feb 3, 2022

qvad assigned bmatican Feb 3, 2022

bmatican assigned ttyusupov and unassigned bmatican Feb 3, 2022

bmatican added the area/docdb YugabyteDB core features label Feb 3, 2022

qvad changed the title ~~[YSQL] Intensive tablet split workload with parallel UPDATE/DELETE all rows leads to restarting Postgres process~~ [YSQL] Intensive workload with parallel UPDATE/DELETE all rows leads to restarting Postgres process Feb 8, 2022

bmatican added the priority/high High Priority label Feb 16, 2022

ttyusupov closed this as completed Mar 21, 2022

arybochkin mentioned this issue Jul 21, 2022

[DocDB] Bootstrap failure in a loop, bringing down tserver #12874

Closed

ttyusupov mentioned this issue Mar 29, 2023

[DocDB] RaftConsensus crash due to incorrect callback #16614

Closed

1 task

rthallamko3 added the 2.12 Backport Required label Apr 3, 2023

yugabyte-ci added the kind/bug This issue is a bug label Apr 3, 2023

rthallamko3 added the 2.8 Backport Required label Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Intensive workload with parallel UPDATE/DELETE all rows leads to restarting Postgres process #11349

[YSQL] Intensive workload with parallel UPDATE/DELETE all rows leads to restarting Postgres process #11349

qvad commented Feb 3, 2022 •

edited by jira bot

Loading

bmatican commented Feb 3, 2022

qvad commented Feb 8, 2022

ttyusupov commented Jul 11, 2022 •

edited

Loading

ttyusupov commented Jul 11, 2022

[YSQL] Intensive workload with parallel UPDATE/DELETE all rows leads to restarting Postgres process #11349

[YSQL] Intensive workload with parallel UPDATE/DELETE all rows leads to restarting Postgres process #11349

Comments

qvad commented Feb 3, 2022 • edited by jira bot Loading

Description

bmatican commented Feb 3, 2022

qvad commented Feb 8, 2022

ttyusupov commented Jul 11, 2022 • edited Loading

ttyusupov commented Jul 11, 2022

qvad commented Feb 3, 2022 •

edited by jira bot

Loading

ttyusupov commented Jul 11, 2022 •

edited

Loading