-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Tablet Bootstrap Hangs Due to Truncate #23243
Labels
2.20 Backport Required
2024.1 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/high
High Priority
Comments
yusong-yan
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
Jul 18, 2024
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Jul 18, 2024
Here is where executor is destroy.
start_latch_.Wait() is released after tablet bootstrap finish. |
yugabyte-ci
added
priority/high
High Priority
and removed
priority/medium
Medium priority issue
labels
Aug 1, 2024
yusong-yan
added a commit
that referenced
this issue
Aug 17, 2024
…eration Summary: **Issue:** Tablet bootstrap can run into a deadlock if it needs to replay a bootstrap operation. Here are the sequence of events leading to the deadlock | Thread 1 (Main bootstrap thread) | Thread 2 (Load Transactions) | | 1. Begin tablet bootstrap | | | 2. During OpenTablet, if transaction is enabled(ysql table or ycql table with transaction enabled), acquire the the `start_latch` by setting its value to 1, and another thread(Thread 2) is created for transaction load| | | | 3. Execute transaction load, acquires `pending_op_counter_blocking_rocksdb_shutdown_start_` to prevent rocksdb shutdown | | 4. Replay tablet truncate operation, waiting for `pending_op_counter_blocking_rocksdb_shutdown_start_` to be released in order to shutdown rocksdb | | | | 5. Transaction load completed | | | 6. Call `LoadFinished` function, it starts waiting for `start_latch` to be 0| | 7. Bootstrap complete, release the the `start_latch` by setting its value to 0 | | | | 8. release the `pending_op_counter_blocking_rocksdb_shutdown_start_` | Thread 1 stucks at step 4, waiting for step 8 to be executed. Thread 2 stucks at step 6, waiting for step 7 to be executed. Result: Tablet bootstrap stuck at replaying truncation operation. This issue starts happening since D29000 (commit id: 5159eb3), the diff Introduced a change to only destroy executor instance(which holds the operation counter) after FinishLoad. **Fix:** Reset `pending_op_counter_blocking_rocksdb_shutdown_start_` before calling `loader_.FinishLoad(status)`. This is not clean fix, but it guarantees the safety, because FinishLoad doesn't need protection from the op counter as it acquires own op counter when processing the pending applies. **Affected Version 2.20.1** Starting from D29000 (commit id: 5159eb3), tablet bootstrap will get stuck when replaying truncate operation. Jira: DB-12175 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test --gtest_filter PgSingleTServerTest.BootstrapReplayTruncate Reviewers: bkolagani, timur, sergei, mbautin, rthallam Reviewed By: sergei, mbautin, rthallam Subscribers: mbautin, yql, slingam, rthallam, ybase Differential Revision: https://phorge.dev.yugabyte.com/D37152
jasonyb
pushed a commit
that referenced
this issue
Aug 20, 2024
Summary: b8cd4da Fix broken header links in Explore section (#23522) cc43b2e [DEVOPS-3048] test automation: Backup unit tests use YBC by default 43652cc [PLAT-13836] Upgrading python setuptools 71f5eeb [doc][yba] Note on deleting KMS config (#23527) a7061c6 [PLAT-14912] Adding replicated migrate guardrails for subdirectories 9b21783 [#23286] xCluster: Speedup setup for large table counts 6d4d8f6 [DEVOPS-3048] test automation: Fix ybc extraction 15fd362 [PLAT-14951] Add positive interger error message to pitr param step c8cbcbf [#23243] docdb: Fix tablet bootstrap stuck when replaying truncate operation 5f286f5 [PLAT-14976] Make node agent silent parameters more obvious by showing in usage 68ac66e [#23492] DocDB: Upgrade and Rollback tests Excluded: 516ead0 [#23304] xCluster: fix ysql_dump/Postgres so pg_class OIDs are preserved Excluded: 16941de [#23304] fix Postgres so old dumps can be loaded that do not have pg_class OIDs 404075d [#23376] DocDB: Utilities needed for HNSW 875ccc1 [PLAT-14077] Update /get endpoint to support db scoped replication tables + metrics 71610b5 [#23536] fix test_macros.h to avoid problems with complaints about capturing variables 027f0e1 [#23493] xCluster: implement function for scanning sequences_data table 9103885 [#22462] DocDB: Enable pg_cron tests in TSAN 7e1f72c [PLAT-14963]Clicking use same/diff replica while DR repair is throwing a permission error for a superadmin user b983d56 [PLAT-14595] Ability to change communication ports via edit universe a6ee050 [PLAT-13285] Make cloud provider edit retryable Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, tfoucher Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D37382
yusong-yan
added a commit
that referenced
this issue
Oct 3, 2024
…playing truncate operation" Summary: Following the changes in D37152, RocksDB could be shut down before the `TransactionLoader::Executor` object is destroyed. This may cause the tserver process to crash during table truncation shutting down RocksDB, due to an unexpected reference count on the SuperVersion object. The crash is triggered by the following failed check: ``` column_family.cc:456] Check failed: is_last_reference ``` The root cause is that the `regular_iterator_` within TransactionLoader::Executor hold references to RocksDB's SuperVersion. To resolve this issue, we propose resetting regular and intent iterators before release the RocksDB scoped pending operation counter lock. It is safe to reset the iterators since they are no longer in use later. Jira: DB-12175 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test --gtest_filter PgSingleTServerTest.BootstrapReplayTruncate Reviewers: rthallam, asrivastava, timur, bkolagani, sergei Reviewed By: bkolagani, sergei Subscribers: ybase, yql, mbautin Differential Revision: https://phorge.dev.yugabyte.com/D38568
timothy-e
pushed a commit
that referenced
this issue
Oct 3, 2024
Summary: 56461a4 [#23243] docdb : Followup fix for "Fix tablet bootstrap stuck when replaying truncate operation" 46f9717 [DOC-492] Added TA-CL-23623: Upgrade failure from v2.20 to v2024.1 series (#24257) 603d4ab Revert "[PLAT-15112] Handle incomplete pexvenv generation" 4c37ae0 format change log (#24259) Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: tfoucher, fizaa, telgersma Differential Revision: https://phorge.dev.yugabyte.com/D38687
yusong-yan
added a commit
that referenced
this issue
Oct 22, 2024
…rap stuck when replaying truncate operation" Summary: Original commit: 56461a4 / D38568 Following the changes in D37152, RocksDB could be shut down before the `TransactionLoader::Executor` object is destroyed. This may cause the tserver process to crash during table truncation shutting down RocksDB, due to an unexpected reference count on the SuperVersion object. The crash is triggered by the following failed check: ``` column_family.cc:456] Check failed: is_last_reference ``` The root cause is that the `regular_iterator_` within TransactionLoader::Executor hold references to RocksDB's SuperVersion. To resolve this issue, we propose resetting regular and intent iterators before release the RocksDB scoped pending operation counter lock. It is safe to reset the iterators since they are no longer in use later. Jira: DB-12175 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test --gtest_filter PgSingleTServerTest.BootstrapReplayTruncate Reviewers: rthallam, asrivastava, timur, bkolagani, sergei Reviewed By: bkolagani Subscribers: mbautin, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D38898
yusong-yan
added a commit
that referenced
this issue
Oct 23, 2024
…ing truncate operation Summary: Original commit: c8cbcbf / D37152 **Issue:** Tablet bootstrap can run into a deadlock if it needs to replay a bootstrap operation. Here are the sequence of events leading to the deadlock | Thread 1 (Main bootstrap thread) | Thread 2 (Load Transactions) | | 1. Begin tablet bootstrap | | | 2. During OpenTablet, if transaction is enabled(ysql table or ycql table with transaction enabled), acquire the the `start_latch` by setting its value to 1, and another thread(Thread 2) is created for transaction load| | | | 3. Execute transaction load, acquires `pending_op_counter_blocking_rocksdb_shutdown_start_` to prevent rocksdb shutdown | | 4. Replay tablet truncate operation, waiting for `pending_op_counter_blocking_rocksdb_shutdown_start_` to be released in order to shutdown rocksdb | | | | 5. Transaction load completed | | | 6. Call `LoadFinished` function, it starts waiting for `start_latch` to be 0| | 7. Bootstrap complete, release the the `start_latch` by setting its value to 0 | | | | 8. release the `pending_op_counter_blocking_rocksdb_shutdown_start_` | Thread 1 stucks at step 4, waiting for step 8 to be executed. Thread 2 stucks at step 6, waiting for step 7 to be executed. Result: Tablet bootstrap stuck at replaying truncation operation. This issue starts happening since D29000 (commit id: 5159eb3), the diff Introduced a change to only destroy executor instance(which holds the operation counter) after FinishLoad. **Fix:** Reset `pending_op_counter_blocking_rocksdb_shutdown_start_` before calling `loader_.FinishLoad(status)`. Also, reset both regular_iterator and intent_iterator as they hold refs to Rocksdb's SuperVersion. This is not clean fix, but it guarantees the safety, because FinishLoad doesn't need protection from the op counter as it acquires own op counter when processing the pending applies. **Affected Version 2.20.1** Starting from D29000 (commit id: 5159eb3), tablet bootstrap will get stuck when replaying truncate operation. Jira: DB-12175 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test --gtest_filter PgSingleTServerTest.BootstrapReplayTruncate Reviewers: bkolagani, timur, sergei, mbautin, rthallam Reviewed By: rthallam Subscribers: ybase, rthallam, slingam, yql, mbautin Differential Revision: https://phorge.dev.yugabyte.com/D37653
yusong-yan
added a commit
that referenced
this issue
Oct 24, 2024
…aying truncate operation Summary: Original commit: c8cbcbf / D37152 **Issue:** Tablet bootstrap can run into a deadlock if it needs to replay a bootstrap operation. Here are the sequence of events leading to the deadlock | Thread 1 (Main bootstrap thread) | Thread 2 (Load Transactions) | | 1. Begin tablet bootstrap | | | 2. During OpenTablet, if transaction is enabled(ysql table or ycql table with transaction enabled), acquire the the `start_latch` by setting its value to 1, and another thread(Thread 2) is created for transaction load| | | | 3. Execute transaction load, acquires `pending_op_counter_blocking_rocksdb_shutdown_start_` to prevent rocksdb shutdown | | 4. Replay tablet truncate operation, waiting for `pending_op_counter_blocking_rocksdb_shutdown_start_` to be released in order to shutdown rocksdb | | | | 5. Transaction load completed | | | 6. Call `LoadFinished` function, it starts waiting for `start_latch` to be 0| | 7. Bootstrap complete, release the the `start_latch` by setting its value to 0 | | | | 8. release the `pending_op_counter_blocking_rocksdb_shutdown_start_` | Thread 1 stucks at step 4, waiting for step 8 to be executed. Thread 2 stucks at step 6, waiting for step 7 to be executed. Result: Tablet bootstrap stuck at replaying truncation operation. This issue starts happening since D29000 (commit id: 5159eb3), the diff Introduced a change to only destroy executor instance(which holds the operation counter) after FinishLoad. **Fix:** Reset `pending_op_counter_blocking_rocksdb_shutdown_start_` before calling `loader_.FinishLoad(status)`. Also, reset both regular_iterator and intent_iterator as they hold refs to Rocksdb's SuperVersion. This is not clean fix, but it guarantees the safety, because FinishLoad doesn't need protection from the op counter as it acquires own op counter when processing the pending applies. **Affected Version 2.20.1** Starting from D29000 (commit id: 5159eb3), tablet bootstrap will get stuck when replaying truncate operation. Jira: DB-12175 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_single_tserver-test --gtest_filter PgSingleTServerTest.BootstrapReplayTruncate Reviewers: bkolagani, timur, sergei, mbautin, rthallam Reviewed By: rthallam Subscribers: ybase, rthallam, slingam, yql, mbautin Differential Revision: https://phorge.dev.yugabyte.com/D37652
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.20 Backport Required
2024.1 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/high
High Priority
Jira Link: DB-12175
Description
We observed tablet bootstrap get stuck during truncate.
Almost likely, it's waiting for
TransactionLoader::Executor
to release its ScopedRWOperation, which only happens after bootstrap completes.Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: