Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Tablet Bootstrap Hangs Due to Truncate #23243

Closed
1 task done
yusong-yan opened this issue Jul 18, 2024 · 1 comment
Closed
1 task done

[DocDB] Tablet Bootstrap Hangs Due to Truncate #23243

yusong-yan opened this issue Jul 18, 2024 · 1 comment
Assignees
Labels

Comments

@yusong-yan
Copy link
Contributor

yusong-yan commented Jul 18, 2024

Jira Link: DB-12175

Description

We observed tablet bootstrap get stuck during truncate.

@     0x7f926b3be3b7  __pthread_cond_timedwait
    @     0x562ec520cd81  std::__1::condition_variable::wait_until<>()
    @     0x562ec53dc557  std::__1::this_thread::sleep_until<>()
    @     0x562ec687791a  yb::RWOperationCounter::DisableAndWaitForOps()
    @     0x562ec6878eef  yb::ScopedRWOperationPause::ScopedRWOperationPause()
    @     0x562ec6200579  yb::tablet::Tablet::PauseReadWriteOperations()
    @     0x562ec61ff98b  yb::tablet::Tablet::StartShutdownRocksDBs()
    @     0x562ec6226adc  yb::tablet::Tablet::Truncate()
    @     0x562ec624c9f7  yb::tablet::TabletBootstrap::PlayAnyRequest()
    @     0x562ec624a93d  yb::tablet::TabletBootstrap::ApplyCommittedPendingReplicates()
    @     0x562ec62448e5  yb::tablet::TabletBootstrap::PlaySegments()
    @     0x562ec6238087  yb::tablet::TabletBootstrap::Bootstrap()
    @     0x562ec62500ac  yb::tablet::BootstrapTablet()
    @     0x562ec64fa5b6  yb::tserver::TSTabletManager::OpenTablet()
    @     0x562ec68b3598  yb::ThreadPool::DispatchThread()
    @     0x562ec68af753  yb::thread::SuperviseThread()

Almost likely, it's waiting for TransactionLoader::Executor to release its ScopedRWOperation, which only happens after bootstrap completes.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@yusong-yan yusong-yan added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jul 18, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jul 18, 2024
@yusong-yan
Copy link
Contributor Author

yusong-yan commented Jul 18, 2024

Here is where executor is destroy.

void LoadFinished(Status load_status) EXCLUDES(status_resolvers_mutex_) override {
 ...
 start_latch_.Wait();

start_latch_.Wait() is released after tablet bootstrap finish.

@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jul 30, 2024
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue labels Aug 1, 2024
yusong-yan added a commit that referenced this issue Aug 17, 2024
…eration

Summary:
**Issue:**
Tablet bootstrap can run into a deadlock if it needs to replay a bootstrap operation. Here are the sequence of events leading to the deadlock

| Thread 1 (Main bootstrap thread) | Thread 2 (Load Transactions) |
| 1. Begin tablet bootstrap | |
| 2. During OpenTablet, if transaction is enabled(ysql table or ycql table with transaction enabled), acquire the the `start_latch` by setting its value to 1, and another thread(Thread 2) is created for transaction load| |
|  | 3. Execute transaction load, acquires `pending_op_counter_blocking_rocksdb_shutdown_start_` to prevent rocksdb shutdown |
| 4. Replay tablet truncate operation, waiting for `pending_op_counter_blocking_rocksdb_shutdown_start_` to be released in order to shutdown rocksdb | |
| | 5. Transaction load completed |
| | 6. Call `LoadFinished` function, it starts waiting for `start_latch` to be 0|
| 7. Bootstrap complete, release the the `start_latch` by setting its value to 0 | |
|  | 8. release the `pending_op_counter_blocking_rocksdb_shutdown_start_` |
Thread 1 stucks at step 4, waiting for step 8 to be executed.
Thread 2 stucks at step 6, waiting for step 7 to be executed.

Result: Tablet bootstrap stuck at replaying truncation operation. This issue starts happening since D29000 (commit id: 5159eb3), the diff Introduced a change to only destroy executor instance(which holds the operation counter) after FinishLoad.

**Fix:**
Reset `pending_op_counter_blocking_rocksdb_shutdown_start_` before calling `loader_.FinishLoad(status)`. This is not clean fix, but it guarantees the safety, because FinishLoad doesn't need protection from the op counter as it acquires own op counter when processing the pending applies.

**Affected Version 2.20.1**
Starting from D29000 (commit id: 5159eb3), tablet bootstrap will get stuck when replaying truncate operation.
Jira: DB-12175

Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test --gtest_filter PgSingleTServerTest.BootstrapReplayTruncate

Reviewers: bkolagani, timur, sergei, mbautin, rthallam

Reviewed By: sergei, mbautin, rthallam

Subscribers: mbautin, yql, slingam, rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37152
jasonyb pushed a commit that referenced this issue Aug 20, 2024
Summary:
 b8cd4da Fix broken header links in Explore section (#23522)
 cc43b2e [DEVOPS-3048] test automation: Backup unit tests use YBC by default
 43652cc [PLAT-13836] Upgrading python setuptools
 71f5eeb [doc][yba] Note on deleting KMS config (#23527)
 a7061c6 [PLAT-14912] Adding replicated migrate guardrails for subdirectories
 9b21783 [#23286] xCluster: Speedup setup for large table counts
 6d4d8f6 [DEVOPS-3048] test automation: Fix ybc extraction
 15fd362 [PLAT-14951] Add positive interger error message to pitr param step
 c8cbcbf [#23243] docdb: Fix tablet bootstrap stuck when replaying truncate operation
 5f286f5 [PLAT-14976] Make node agent silent parameters more obvious by showing in usage
 68ac66e [#23492] DocDB: Upgrade and Rollback tests
 Excluded: 516ead0 [#23304] xCluster: fix ysql_dump/Postgres so pg_class OIDs are preserved
 Excluded: 16941de [#23304] fix Postgres so old dumps can be loaded that do not have pg_class OIDs
 404075d [#23376] DocDB: Utilities needed for HNSW
 875ccc1 [PLAT-14077] Update /get endpoint to support db scoped replication tables + metrics
 71610b5 [#23536] fix test_macros.h to avoid problems with complaints about capturing variables
 027f0e1 [#23493] xCluster: implement function for scanning sequences_data table
 9103885 [#22462] DocDB: Enable pg_cron tests in TSAN
 7e1f72c [PLAT-14963]Clicking use same/diff replica while DR repair is throwing a permission error for a superadmin user
 b983d56 [PLAT-14595] Ability to change communication ports via edit universe
 a6ee050 [PLAT-13285] Make cloud provider edit retryable

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37382
yusong-yan added a commit that referenced this issue Oct 3, 2024
…playing truncate operation"

Summary:
Following the changes in D37152, RocksDB could be shut down before the `TransactionLoader::Executor` object is destroyed. This may cause the tserver process to crash during table truncation shutting down RocksDB, due to an unexpected reference count on the SuperVersion object. The crash is triggered by the following failed check:
```
column_family.cc:456] Check failed: is_last_reference
```
The root cause is that the `regular_iterator_` within TransactionLoader::Executor hold references to RocksDB's SuperVersion.

To resolve this issue, we propose resetting regular and intent iterators before release the RocksDB scoped pending operation counter lock. It is safe to reset the iterators since they are no longer in use later.
Jira: DB-12175

Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test --gtest_filter PgSingleTServerTest.BootstrapReplayTruncate

Reviewers: rthallam, asrivastava, timur, bkolagani, sergei

Reviewed By: bkolagani, sergei

Subscribers: ybase, yql, mbautin

Differential Revision: https://phorge.dev.yugabyte.com/D38568
timothy-e pushed a commit that referenced this issue Oct 3, 2024
Summary:
 56461a4 [#23243] docdb : Followup fix for "Fix tablet bootstrap stuck when replaying truncate operation"
 46f9717 [DOC-492] Added TA-CL-23623: Upgrade failure from v2.20 to v2024.1 series (#24257)
 603d4ab Revert "[PLAT-15112] Handle incomplete pexvenv generation"
 4c37ae0 format change log (#24259)

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: tfoucher, fizaa, telgersma

Differential Revision: https://phorge.dev.yugabyte.com/D38687
yusong-yan added a commit that referenced this issue Oct 22, 2024
…rap stuck when replaying truncate operation"

Summary:
Original commit: 56461a4 / D38568
Following the changes in D37152, RocksDB could be shut down before the `TransactionLoader::Executor` object is destroyed. This may cause the tserver process to crash during table truncation shutting down RocksDB, due to an unexpected reference count on the SuperVersion object. The crash is triggered by the following failed check:
```
column_family.cc:456] Check failed: is_last_reference
```
The root cause is that the `regular_iterator_` within TransactionLoader::Executor hold references to RocksDB's SuperVersion.

To resolve this issue, we propose resetting regular and intent iterators before release the RocksDB scoped pending operation counter lock. It is safe to reset the iterators since they are no longer in use later.
Jira: DB-12175

Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test --gtest_filter PgSingleTServerTest.BootstrapReplayTruncate

Reviewers: rthallam, asrivastava, timur, bkolagani, sergei

Reviewed By: bkolagani

Subscribers: mbautin, yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38898
yusong-yan added a commit that referenced this issue Oct 23, 2024
…ing truncate operation

Summary:
Original commit: c8cbcbf / D37152
**Issue:**
Tablet bootstrap can run into a deadlock if it needs to replay a bootstrap operation. Here are the sequence of events leading to the deadlock

| Thread 1 (Main bootstrap thread) | Thread 2 (Load Transactions) |
| 1. Begin tablet bootstrap | |
| 2. During OpenTablet, if transaction is enabled(ysql table or ycql table with transaction enabled), acquire the the `start_latch` by setting its value to 1, and another thread(Thread 2) is created for transaction load| |
|  | 3. Execute transaction load, acquires `pending_op_counter_blocking_rocksdb_shutdown_start_` to prevent rocksdb shutdown |
| 4. Replay tablet truncate operation, waiting for `pending_op_counter_blocking_rocksdb_shutdown_start_` to be released in order to shutdown rocksdb | |
| | 5. Transaction load completed |
| | 6. Call `LoadFinished` function, it starts waiting for `start_latch` to be 0|
| 7. Bootstrap complete, release the the `start_latch` by setting its value to 0 | |
|  | 8. release the `pending_op_counter_blocking_rocksdb_shutdown_start_` |
Thread 1 stucks at step 4, waiting for step 8 to be executed.
Thread 2 stucks at step 6, waiting for step 7 to be executed.

Result: Tablet bootstrap stuck at replaying truncation operation. This issue starts happening since D29000 (commit id: 5159eb3), the diff Introduced a change to only destroy executor instance(which holds the operation counter) after FinishLoad.

**Fix:**
Reset `pending_op_counter_blocking_rocksdb_shutdown_start_` before calling `loader_.FinishLoad(status)`. Also, reset both regular_iterator and intent_iterator as they hold refs to Rocksdb's SuperVersion. This is not clean fix, but it guarantees the safety, because FinishLoad doesn't need protection from the op counter as it acquires own op counter when processing the pending applies.

**Affected Version 2.20.1**
Starting from D29000 (commit id: 5159eb3), tablet bootstrap will get stuck when replaying truncate operation.
Jira: DB-12175

Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test --gtest_filter PgSingleTServerTest.BootstrapReplayTruncate

Reviewers: bkolagani, timur, sergei, mbautin, rthallam

Reviewed By: rthallam

Subscribers: ybase, rthallam, slingam, yql, mbautin

Differential Revision: https://phorge.dev.yugabyte.com/D37653
yusong-yan added a commit that referenced this issue Oct 24, 2024
…aying truncate operation

Summary:
Original commit: c8cbcbf / D37152
**Issue:**
Tablet bootstrap can run into a deadlock if it needs to replay a bootstrap operation. Here are the sequence of events leading to the deadlock

| Thread 1 (Main bootstrap thread) | Thread 2 (Load Transactions) |
| 1. Begin tablet bootstrap | |
| 2. During OpenTablet, if transaction is enabled(ysql table or ycql table with transaction enabled), acquire the the `start_latch` by setting its value to 1, and another thread(Thread 2) is created for transaction load| |
|  | 3. Execute transaction load, acquires `pending_op_counter_blocking_rocksdb_shutdown_start_` to prevent rocksdb shutdown |
| 4. Replay tablet truncate operation, waiting for `pending_op_counter_blocking_rocksdb_shutdown_start_` to be released in order to shutdown rocksdb | |
| | 5. Transaction load completed |
| | 6. Call `LoadFinished` function, it starts waiting for `start_latch` to be 0|
| 7. Bootstrap complete, release the the `start_latch` by setting its value to 0 | |
|  | 8. release the `pending_op_counter_blocking_rocksdb_shutdown_start_` |
Thread 1 stucks at step 4, waiting for step 8 to be executed.
Thread 2 stucks at step 6, waiting for step 7 to be executed.

Result: Tablet bootstrap stuck at replaying truncation operation. This issue starts happening since D29000 (commit id: 5159eb3), the diff Introduced a change to only destroy executor instance(which holds the operation counter) after FinishLoad.

**Fix:**
Reset `pending_op_counter_blocking_rocksdb_shutdown_start_` before calling `loader_.FinishLoad(status)`. Also, reset both regular_iterator and intent_iterator as they hold refs to Rocksdb's SuperVersion. This is not clean fix, but it guarantees the safety, because FinishLoad doesn't need protection from the op counter as it acquires own op counter when processing the pending applies.

**Affected Version 2.20.1**
Starting from D29000 (commit id: 5159eb3), tablet bootstrap will get stuck when replaying truncate operation.
Jira: DB-12175

Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test --gtest_filter PgSingleTServerTest.BootstrapReplayTruncate

Reviewers: bkolagani, timur, sergei, mbautin, rthallam

Reviewed By: rthallam

Subscribers: ybase, rthallam, slingam, yql, mbautin

Differential Revision: https://phorge.dev.yugabyte.com/D37652
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants