Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Intensive workload with parallel UPDATE/DELETE all rows leads to restarting Postgres process #11349

Closed
qvad opened this issue Feb 3, 2022 · 4 comments
Assignees
Labels
area/docdb YugabyteDB core features area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/high High Priority

Comments

@qvad
Copy link
Contributor

qvad commented Feb 3, 2022

Jira Link: DB-6047

Description

While running sample-app SqlDataLoad with additional operations (DELETE all presented rows and UPDATE all presented rows, see code in the attachments) few FATALs started to appear in client code. tserver and master logs does not contain any failures and looks like it's Postgres process that failed.

Workload w/o additional operations doesn't contain these exceptions

Client log:

340697 [Thread-106] INFO com.yugabyte.sample.apps.AppBase  - Going to retrieve connection again: FATAL: Network error: recvmsg error: Connection refused
	at com.yugabyte.Driver.makeConnection(Driver.java:496)
	at com.yugabyte.Driver.connect(Driver.java:284)
340697 [Thread-137] INFO com.yugabyte.sample.apps.AppBase  - Going to retrieve connection again: FATAL: Network error: recvmsg error: Connection refused
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:677)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189)
...
	at com.yugabyte.sample.common.IOPSThread.run(IOPSThread.java:87)
340883 [Thread-39] INFO com.yugabyte.sample.apps.AppBase  - Going to retrieve connection again: FATAL: the database system is shutting down
com.yugabyte.util.PSQLException: FATAL: the database system is shutting down

Here is one of Postgres log files: postgres-log.txt

Reproducer:
Example of run code

java -jar yb-sample-apps.jar --workload SqlDataLoadWithDDL  --num_writes -1 --num_threads_write 100 --num_threads_read 100 --num_reads -1 --num_unique_keys 300000000 --num_value_columns 20 --ddl_operations DROP_ROWS,UPDATE_ROWS --nodes ...

SqlDataLoadWithDDL.java.zip

@qvad qvad added the area/ysql Yugabyte SQL (YSQL) label Feb 3, 2022
@bmatican bmatican assigned ttyusupov and unassigned bmatican Feb 3, 2022
@bmatican bmatican added the area/docdb YugabyteDB core features label Feb 3, 2022
@bmatican
Copy link
Contributor

bmatican commented Feb 3, 2022

Might be that the client doesn't retry certain types of errors (at all or enough). Will talk to Timur next week.

@qvad
Copy link
Contributor Author

qvad commented Feb 8, 2022

Full logs from recent quick run.
archive-2.zip

@qvad qvad changed the title [YSQL] Intensive tablet split workload with parallel UPDATE/DELETE all rows leads to restarting Postgres process [YSQL] Intensive workload with parallel UPDATE/DELETE all rows leads to restarting Postgres process Feb 8, 2022
@bmatican bmatican added the priority/high High Priority label Feb 16, 2022
ttyusupov added a commit that referenced this issue Mar 21, 2022
…nd ConsensusRound callbacks invokation

Summary:
**Background**

Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs.
Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure.

When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure.
On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side.

But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this).

**Problem**

On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario:
1) Tablet `T` is split into `T1` and `T2`.
1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`.
3) Tablet `T1` is split into `T1.1` and `T1.2`
4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2`
5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers.
6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`.
7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers.
8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216".

**Solution**

The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again.

Also, some other fixes/improvements have been made:
- Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed.
- Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error.
- Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address.
- Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction.
- Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds.
- Refactored `OperationDriver::Init` to exit early in case of error.
- Enhanced `CatalogManager::RegisterNewTabletForSplit` logging.
- Fixed `OperationTracker::Add` null pointer handling.
- Added test.

Test Plan: TabletSplitITest.SplitClientRequestsClean

Reviewers: sergei

Reviewed By: sergei

Subscribers: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D15484
ttyusupov added a commit that referenced this issue Jun 29, 2022
…g at follower and ConsensusRound callbacks invocation

Summary:
**Background**

Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs.
Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure.

When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure.
On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side.

But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this).

**Problem**

On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario:
1) Tablet `T` is split into `T1` and `T2`.
1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`.
3) Tablet `T1` is split into `T1.1` and `T1.2`
4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2`
5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers.
6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`.
7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers.
8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216".

**Solution**

The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again.

Also, some other fixes/improvements have been made:
- Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed.
- Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error.
- Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address.
- Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction.
- Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds.
- Refactored `OperationDriver::Init` to exit early in case of error.
- Enhanced `CatalogManager::RegisterNewTabletForSplit` logging.
- Fixed `OperationTracker::Add` null pointer handling.
- Added test.

Original commit: d26017d / D15484

Test Plan: TabletSplitITest.SplitClientRequestsClean

Reviewers: jmeehan

Reviewed By: jmeehan

Subscribers: jmeehan, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D17854
ttyusupov added a commit that referenced this issue Jul 1, 2022
…ing at follower and ConsensusRound callbacks invokation

Summary:
**Background**

Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs.
Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure.

When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure.
On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side.

But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this).

**Problem**

On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario:
1) Tablet `T` is split into `T1` and `T2`.
1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`.
3) Tablet `T1` is split into `T1.1` and `T1.2`
4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2`
5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers.
6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`.
7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers.
8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216".

**Solution**

The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again.

Also, some other fixes/improvements have been made:
- Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed.
- Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error.
- Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address.
- Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction.
- Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds.
- Refactored `OperationDriver::Init` to exit early in case of error.
- Enhanced `CatalogManager::RegisterNewTabletForSplit` logging.
- Fixed `OperationTracker::Add` null pointer handling.
- Added test.

Original commit: d26017d / D15484

Test Plan: TabletSplitITest.SplitClientRequestsClean

Reviewers: amitanand

Reviewed By: amitanand

Subscribers: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D18036
ttyusupov added a commit that referenced this issue Jul 6, 2022
… at follower and ConsensusRound callbacks invocation

Summary:
**Background**

Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs.
Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure.

When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure.
On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side.

But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this).

**Problem**

On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario:
1) Tablet `T` is split into `T1` and `T2`.
1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`.
3) Tablet `T1` is split into `T1.1` and `T1.2`
4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2`
5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers.
6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`.
7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers.
8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216".

**Solution**

The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again.

Also, some other fixes/improvements have been made:
- Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed.
- Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error.
- Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address.
- Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction.
- Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds.
- Refactored `OperationDriver::Init` to exit early in case of error.
- Enhanced `CatalogManager::RegisterNewTabletForSplit` logging.
- Fixed `OperationTracker::Add` null pointer handling.
- Added test.

Original commit: d26017d / D15484

Test Plan: TabletSplitITest.SplitClientRequestsClean

Reviewers: amitanand

Reviewed By: amitanand

Subscribers: ybase, bogdan, jmeehan

Differential Revision: https://phabricator.dev.yugabyte.com/D18127
ttyusupov added a commit that referenced this issue Jul 6, 2022
…ing at follower and ConsensusRound callbacks invocation

Summary:
**Background**

Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs.
Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure.

When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure.
On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side.

But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this).

**Problem**

On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario:
1) Tablet `T` is split into `T1` and `T2`.
1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`.
3) Tablet `T1` is split into `T1.1` and `T1.2`
4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2`
5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers.
6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`.
7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers.
8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216".

**Solution**

The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again.

Also, some other fixes/improvements have been made:
- Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed.
- Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error.
- Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address.
- Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction.
- Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds.
- Refactored `OperationDriver::Init` to exit early in case of error.
- Enhanced `CatalogManager::RegisterNewTabletForSplit` logging.
- Fixed `OperationTracker::Add` null pointer handling.
- Added test.

Original commit: d26017d / D15484

Test Plan: TabletSplitITest.SplitClientRequestsClean

Reviewers: mbautin, amitanand

Reviewed By: amitanand

Subscribers: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D18116
ttyusupov added a commit that referenced this issue Jul 7, 2022
…ing at follower and ConsensusRound callbacks invocation

Summary:
**Background**

Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs.
Request ID is assigned by `YBClient` based on `TabletRequests::request_id_seq` and TServer tracks running requests IDs in `RetryableRequests` structure.

When we split tablet, each of its child tablets gets copy of its Raft log and also `RetryableRequests` structure.
On `YBClient` side, child tablets get `TabletRequests::request_id_seq` value from the parent tablet to be consistent with TServer side.

But if process hosting `YBClient` has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about `TabletRequests::request_id_seq` for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If `YBClient` (`MetaCache`) doesn't know about the parent tablet, it use special value as `request_id_seq` for child tablet. Then, on getting "request id is less than min" error - it sets `request_id_seq` to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this).

**Problem**

On TServer side `MaintenanceManager` periodically triggers `RetryableRequests` structure cleanup. And due to overloading or having too much tablets, it can happen that at tablet leader `RetryableRequests` structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario:
1) Tablet `T` is split into `T1` and `T2`.
1) `YBClient` A starts doing some writes to tablet `T1`. Since A doesn't know about tablet T, it sets `request_id_seq` to `min_running_request_id` for A from the T leader (0) plus 1<<24. After 100 writes it becomes `100 + 1 << 24`.
3) Tablet `T1` is split into `T1.1` and `T1.2`
4) Tablet `T1.1` is split into `T1.1.1` and `T1.1.2`
5) `RetryableRequests` cleanup is done at the leader of `T1.1`, but not at followers.
6) `YBClient` A starts doing write to tablet `T1.1.2`. Since A doesn't know about tablet `T1.1`, it sets `request_id_seq` to `min_running_request_id` for A from `T1.1` leader (0 due to cleanup) plus `1<<24`, that is `1<<24`.
7) `T1.1` leader accepts write with `request_id = 1 <<24` and tries to replicate it at followers.
8) At follower, `min_running_request_id` is `100 + 1 << 24` (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216".

**Solution**

The solution is to update `ReplicaState::AddPendingOperation` to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again.

Also, some other fixes/improvements have been made:
- Updated `RetryableRequests::Register` to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed.
- Updated `RaftConsensus::StartConsensusOnlyRoundUnlocked` to invoke callback in case of error.
- Updated `YBClient` logging to have prefix with both client ID and `MetaCache` address.
- Added sanity check to avoid duplicate `ConsensusRoundCallback` calls and to verify that callback is always got called before destruction.
- Updated `RaftConsensus::AppendNewRoundsToQueueUnlocked` to reduce code duplication and made it responsible for invoking callback for processed rounds.
- Refactored `OperationDriver::Init` to exit early in case of error.
- Enhanced `CatalogManager::RegisterNewTabletForSplit` logging.
- Fixed `OperationTracker::Add` null pointer handling.
- Added test.

Original commit: d26017d / D15484

Test Plan: TabletSplitITest.SplitClientRequestsClean

Reviewers: amitanand

Reviewed By: amitanand

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D18150
@ttyusupov
Copy link
Contributor

ttyusupov commented Jul 11, 2022

Summary:
Background

Since D5660/023c20ad6caf13e5228081ddec07d9f3cf904348 we track Write RPC request ID to prevent duplicate writes caused by retries of Write RPCs.
Request ID is assigned by YBClient based on TabletRequests::request_id_seq and TServer tracks running requests IDs in RetryableRequests structure.

When we split tablet, each of its child tablets gets copy of its Raft log and also RetryableRequests structure.
On YBClient side, child tablets get TabletRequests::request_id_seq value from the parent tablet to be consistent with TServer side.

But if process hosting YBClient has been restarted / partioned away or simply not issuing request to specific tablets during sequence of split, it might have no information about particular tablet and doesn't know about TabletRequests::request_id_seq for the parent tablet, so can't start from it for child tablets. This case was addressed by D9264/f9cae11ced426476d4f70560b2afc29f921a1737. If YBClient (MetaCache) doesn't know about the parent tablet, it use special value as request_id_seq for child tablet. Then, on getting "request id is less than min" error - it sets request_id_seq to min running request ID from the server plus 2^24 (there wouldn't be 2^24 client requests in progress from the same client to the same tablet, so it is safe to do this).

Problem

On TServer side MaintenanceManager periodically triggers RetryableRequests structure cleanup to remove IDs of expired requests that are older than retryable_request_timeout_secs ago. And due to overloading or having too much tablets, it can happen that at tablet leader RetryableRequests structure is already cleaned up, but at the follower is not yet. This can lead to the following failure scenario:

  1. Tablet T is split into T1 and T2.
  2. YBClient A starts doing some writes to tablet T1. Since A doesn't know about tablet T, it sets request_id_seq to min_running_request_id for A from the T leader (0) plus 1<<24. After 100 writes it becomes 100 + 1 << 24.
  3. Tablet T1 is split into T1.1 and T1.2
  4. Tablet T1.1 is split into T1.1.1 and T1.1.2
  5. RetryableRequests cleanup is done at the leader of T1.1, but not at followers.
  6. YBClient A starts doing write to tablet T1.1.2. Since A doesn't know about tablet T1.1, it sets request_id_seq to min_running_request_id for A from T1.1 leader (0 due to cleanup) plus 1<<24, that is 1<<24.
  7. T1.1 leader accepts write with request_id = 1 <<24 and tries to replicate it at followers.
  8. At follower, min_running_request_id is 100 + 1 << 24 (inherited from parent tablets). So, follower rejects write request that is already accepted by leader. Error message is: "Request id 16777316 from client ... is less than min running 16777216".

Solution

The solution is to update ReplicaState::AddPendingOperation to make it run cleanup at the follower in case of getting error when trying to register retryable request ID and try to register again.

Also, some other fixes/improvements have been made:

  • Updated RetryableRequests::Register to return error instead of invoking callback on error by itself. Callers are responsible for invoking callback if needed.
  • Updated RaftConsensus::StartConsensusOnlyRoundUnlocked to invoke callback in case of error.
  • Updated YBClient logging to have prefix with both client ID and MetaCache address.
  • Added sanity check to avoid duplicate ConsensusRoundCallback calls and to verify that callback is always got called before destruction.
  • Updated RaftConsensus::AppendNewRoundsToQueueUnlocked to reduce code duplication and made it responsible for invoking callback for processed rounds.
  • Refactored OperationDriver::Init to exit early in case of error.
  • Enhanced CatalogManager::RegisterNewTabletForSplit logging.
  • Fixed OperationTracker::Add null pointer handling.
  • Added test.

@ttyusupov
Copy link
Contributor

Without dynamic tablet splitting enabled it can also happen that YBClient inside YQL layer will retry RPC to tserver within client_read_write_timeout_ms or ysql_client_read_write_timeout_ms for YSQL. And if retrying original tserver RPC happens after retryable_request_timeout_secs (120 seconds by default), we will have the same issue failing to apply duplicate WRITE_OP at the follower even that at leader side it wasn't treated as duplicate and has been successfully applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/high High Priority
Projects
None yet
Development

No branches or pull requests

5 participants