Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] Mutex lock error in CDCStreamLoader leading to master crash loop #23278

Closed
1 task done
shamanthchandra-yb opened this issue Jul 24, 2024 · 0 comments
Closed
1 task done
Assignees
Labels
2.20.6_blocker area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented Jul 24, 2024

Jira Link: DB-12205

Description

Testname: test_cdc_main_without_tablet_splitting

* thread #1, name = 'yb-master', stop reason = signal SIGSEGV
  * frame #0: 0x00007fa5c14f2a84 libpthread.so.0`__pthread_mutex_lock + 4
    frame #1: 0x0000563ae963cbe8 yb-master`yb::RWCLock::ReadLock() [inlined] yb::Mutex::Acquire(this=0x0000000000000008) at mutex.cc:85:12
    frame #2: 0x0000563ae963cbe3 yb-master`yb::RWCLock::ReadLock() [inlined] yb::MutexLock::MutexLock(this=<unavailable>, lock=0x0000000000000008) at mutex.h:107:12
    frame #3: 0x0000563ae963cbe3 yb-master`yb::RWCLock::ReadLock(this=<unavailable>) at rwc_lock.cc:80:13
    frame #4: 0x0000563ae8837369 yb-master`yb::master::CDCStreamLoader::Visit(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::master::SysCDCStreamEntryPB const&) [inlined] yb::CowObject<yb::master::PersistentTableInfo>::ReadLock(this=0x0000000000000008) const at cow_object.h:59:11
    frame #5: 0x0000563ae8837361 yb-master`yb::master::CDCStreamLoader::Visit(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::master::SysCDCStreamEntryPB const&) [inlined] yb::CowReadLock<yb::master::PersistentTableInfo>::CowReadLock(this=0x00007fa5ac9b7d60, cow=0x0000000000000008) at cow_object.h:171:11
    frame #6: 0x0000563ae883735a yb-master`yb::master::CDCStreamLoader::Visit(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::master::SysCDCStreamEntryPB const&) [inlined] yb::master::MetadataCowWrapper<yb::master::PersistentTableInfo>::LockForRead(this=0x0000000000000000) const at catalog_entity_base.h:84:41
    frame #7: 0x0000563ae8837356 yb-master`yb::master::CDCStreamLoader::Visit(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::master::SysCDCStreamEntryPB const&) [inlined] yb::master::TableInfo::GetSchema(this=0x0000000000000000, schema=0x00007fa5ac9b7c00) const at catalog_entity_info.cc:463:23
    frame #8: 0x0000563ae8837356 yb-master`yb::master::CDCStreamLoader::Visit(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, yb::master::SysCDCStreamEntryPB const&) at xrepl_catalog_manager.cc:1809:35
    frame #9: 0x0000563ae88372f8 yb-master`yb::master::CDCStreamLoader::Visit(this=<unavailable>, stream_id_str=<unavailable>, metadata=0x00007fa5ac9b7fa0) at xrepl_catalog_manager.cc:282:27
    frame #10: 0x0000563ae8835985 yb-master`yb::master::Visitor<yb::master::PersistentCDCStreamInfo>::Visit(this=0x000034d2bb0939a0, id=(begin_ = "0193479d43ba53a3a743e1e644758c40", end_ = 0x0000000000000000), data=<unavailable>) at sys_catalog-internal.h:57:12
    frame #11: 0x0000563ae874f93f yb-master`std::__1::__function::__func<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0, std::__1::allocator<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0>, yb::Status (yb::Slice const&, yb::Slice const&)>::operator()(yb::Slice const&, yb::Slice const&) [inlined] yb::master::SysCatalogTable::Visit(this=<unavailable>, id=<unavailable>, data=<unavailable>)::$_0::operator()(yb::Slice const&, yb::Slice const&) const at sys_catalog.cc:879:3
    frame #12: 0x0000563ae874f914 yb-master`std::__1::__function::__func<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0, std::__1::allocator<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0>, yb::Status (yb::Slice const&, yb::Slice const&)>::operator()(yb::Slice const&, yb::Slice const&) [inlined] decltype(__f=<unavailable>, __args=<unavailable>, __args=<unavailable>)::$_0&>()(std::declval<yb::Slice const&>(), std::declval<yb::Slice const&>())) std::__1::__invoke[abi:ue170006]<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0&, yb::Slice const&, yb::Slice const&>(yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0&, yb::Slice const&, yb::Slice const&) at invoke.h:340:25
    frame #13: 0x0000563ae874f914 yb-master`std::__1::__function::__func<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0, std::__1::allocator<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0>, yb::Status (yb::Slice const&, yb::Slice const&)>::operator()(yb::Slice const&, yb::Slice const&) [inlined] yb::Status std::__1::__invoke_void_return_wrapper<yb::Status, false>::__call[abi:ue170006]<yb::master::SysCatalogTable::Visit(__args=<unavailable>, __args=<unavailable>, __args=<unavailable>)::$_0&, yb::Slice const&, yb::Slice const&>(yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0&, yb::Slice const&, yb::Slice const&) at invoke.h:407:12
    frame #14: 0x0000563ae874f914 yb-master`std::__1::__function::__func<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0, std::__1::allocator<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0>, yb::Status (yb::Slice const&, yb::Slice const&)>::operator()(yb::Slice const&, yb::Slice const&) [inlined] std::__1::__function::__alloc_func<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0, std::__1::allocator<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0>, yb::Status (yb::Slice const&, yb::Slice const&)>::operator(this=<unavailable>, __arg=<unavailable>, __arg=<unavailable>)[abi:ue170006](yb::Slice const&, yb::Slice const&) at function.h:192:16
    frame #15: 0x0000563ae874f914 yb-master`std::__1::__function::__func<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0, std::__1::allocator<yb::master::SysCatalogTable::Visit(yb::master::VisitorBase*)::$_0>, yb::Status (yb::Slice const&, yb::Slice const&)>::operator(this=<unavailable>, __arg=<unavailable>, __arg=<unavailable>)(yb::Slice const&, yb::Slice const&) at function.h:363:12
    frame #16: 0x0000563ae8764991 yb-master`yb::master::EnumerateSysCatalog(yb::docdb::DocRowwiseIterator*, yb::Schema const&, signed char, std::__1::function<yb::Status (yb::Slice const&, yb::Slice const&)> const&) [inlined] std::__1::__function::__value_func<yb::Status (yb::Slice const&, yb::Slice const&)>::operator(this=0x00007fa5ac9b8940, __args=0x00007fa5ac9b86e0, __args=0x00007fa5ac9b8748)[abi:ue170006](yb::Slice const&, yb::Slice const&) const at function.h:517:16
    frame #17: 0x0000563ae8764973 yb-master`yb::master::EnumerateSysCatalog(yb::docdb::DocRowwiseIterator*, yb::Schema const&, signed char, std::__1::function<yb::Status (yb::Slice const&, yb::Slice const&)> const&) [inlined] std::__1::function<yb::Status (yb::Slice const&, yb::Slice const&)>::operator(this=0x00007fa5ac9b8940, __arg=0x00007fa5ac9b86e0, __arg=0x00007fa5ac9b8748)(yb::Slice const&, yb::Slice const&) const at function.h:1168:12
    frame #18: 0x0000563ae8764973 yb-master`yb::master::EnumerateSysCatalog(yb::docdb::DocRowwiseIterator*, yb::Schema const&, signed char, std::__1::function<yb::Status (yb::Slice const&, yb::Slice const&)> const&) [inlined] yb::master::(anonymous namespace)::ReadNextSysCatalogRow(value_map=0x00007fa5ac9b8360, schema=0x000034d2bfc2e218, entry_type='\n', type_col_idx=0, entry_id_col_idx=1, metadata_col_idx=2, callback=0x00007fa5ac9b8940)> const&) at sys_catalog_writer.cc:78:10
    frame #19: 0x0000563ae8764829 yb-master`yb::master::EnumerateSysCatalog(doc_iter=0x000034d2bb4ed000, schema=0x000034d2bfc2e218, entry_type='\n', callback=0x00007fa5ac9b8940)> const&) at sys_catalog_writer.cc:228:5
    frame #20: 0x0000563ae87644e6 yb-master`yb::master::EnumerateSysCatalog(tablet=<unavailable>, schema=0x000034d2bfc2e218, entry_type='\n', callback=0x00007fa5ac9b8940)> const&) at sys_catalog_writer.cc:206:10
    frame #21: 0x0000563ae874d9bb yb-master`yb::master::SysCatalogTable::Visit(this=0x000034d2bf82d180, visitor=0x000034d2bb0939a0) at sys_catalog.cc:879:3
    frame #22: 0x0000563ae84a3156 yb-master`yb::master::CatalogManager::RunLoaders(yb::master::SysCatalogLoadingState*) at xrepl_catalog_manager.cc:320:3
    frame #23: 0x0000563ae84a3093 yb-master`yb::master::CatalogManager::RunLoaders(this=0x000034d2bf8e7600, state=0x00007fa5ac9b8e30) at catalog_manager.cc:1490:3
    frame #24: 0x0000563ae8498faa yb-master`yb::master::CatalogManager::VisitSysCatalog(this=0x000034d2bf8e7600, state=0x00007fa5ac9b8e30) at catalog_manager.cc:1325:5
    frame #25: 0x0000563ae84959c2 yb-master`yb::master::CatalogManager::LoadSysCatalogDataTask(this=0x000034d2bf8e7600) at catalog_manager.cc:1121:21
    frame #26: 0x0000563ae96669f5 yb-master`yb::ThreadPool::DispatchThread(this=0x000034d2bf82cc40, permanent=<unavailable>) at threadpool.cc:612:22
    frame #27: 0x0000563ae9662e03 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x000034d2be437da0)[abi:ue170006]() const at function.h:517:16
    frame #28: 0x0000563ae9662ded yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator(this=0x000034d2be437da0)() const at function.h:1168:12
    frame #29: 0x0000563ae9662ded yb-master`yb::Thread::SuperviseThread(arg=0x000034d2be437d40) at thread.cc:866:3
    frame #30: 0x00007fa5c14f01ca libpthread.so.0`start_thread + 234
    frame #31: 0x00007fa5c1741e73 libc.so.6`__clone + 67

Suspecting this could be related to #22773 changes.

Please find slack thread and stress run link in JIRA.

Source connector version

1.9.5.y.220.4

Connector configuration

add connector connector_name='ybconnector_cdc_b470fd_test_cdc_af0bbb' stream_id='362f7abff66371acd249e5dd87189486' db_name='cdc_b470fd' connector_host='172.151.22.251' table_list=['test_cdc_af0bbb'] {'name': 'ybconnector_cdc_b470fd_test_cdc_af0bbb', 'config': {'connector.class': 'io.debezium.connector.yugabytedb.YugabyteDBConnector', 'database.hostname': '172.151.31.172:5433,172.151.28.34:5433,172.151.25.60:5433', 'database.master.addresses': '172.151.31.172:7100,172.151.28.34:7100,172.151.25.60:7100', 'database.port': 5433, 'database.masterhost': '172.151.28.34', 'database.masterport': '7100', 'database.user': 'yugabyte', 'database.password': 'yugabyte', 'database.dbname': 'cdc_b470fd', 'database.server.name': 'db_cdc', 'database.streamid': '362f7abff66371acd249e5dd87189486', 'snapshot.mode': 'initial', 'admin.operation.timeout.ms': 600000, 'socket.read.timeout.ms': 300000, 'max.connector.retries': '10', 'operation.timeout.ms': 600000, 'topic.creation.default.compression.type': 'lz4', 'topic.creation.default.cleanup.policy': 'delete', 'topic.creation.default.partitions': 2, 'topic.creation.default.replication.factor': '1', 'tasks.max': '5', 'table.include.list': 'public.test_cdc_af0bbb'}}

YugabyteDB version

2.23.0.0-b625

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shamanthchandra-yb shamanthchandra-yb added priority/high High Priority area/cdcsdk CDC SDK status/awaiting-triage Issue awaiting triage labels Jul 24, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug and removed status/awaiting-triage Issue awaiting triage labels Jul 24, 2024
@yugabyte-ci yugabyte-ci changed the title [CDCSDK] Mutex lock error in CDCStreamLoader leading to tserver crash loop [CDCSDK] Mutex lock error in CDCStreamLoader leading to master crash loop Aug 5, 2024
siddharth2411 added a commit that referenced this issue Aug 5, 2024
…hile loading CDC stream

Summary:
When a table present under a CDC stream is dropped, it is removed from the CDC stream metadata by a background thread.
Suppose before the background thread could cleanup, there was a master restart or a master leadership change. On either of these scenarios, while loading the CDC streams, we check all tables present in the CDC stream metadata for ineligibility. Table schema is one of the objects that is scanned while checking for ineligibility. To get the table schema, we fetch the `TableInfo` object from master. This step was leading to a master crash as we receive a nullptr while fetching TableInfo since the table has been dropped.
Jira: DB-12205

Test Plan: ./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTablesCleanupWhenDropTableCleanupIsDisabled

Reviewers: hsunder, asrinivasan, stiwary, skumar

Reviewed By: skumar

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37053
siddharth2411 added a commit that referenced this issue Aug 6, 2024
…with drop table while loading CDC stream

Summary:
**Backport Description:**
Faced minor merge conflicts as some code is refactored in latest master.

**Original Description:**
Original commit: 64e1bf8 / D37053
When a table present under a CDC stream is dropped, it is removed from the CDC stream metadata by a background thread.
Suppose before the background thread could cleanup, there was a master restart or a master leadership change. On either of these scenarios, while loading the CDC streams, we check all tables present in the CDC stream metadata for ineligibility. Table schema is one of the objects that is scanned while checking for ineligibility. To get the table schema, we fetch the `TableInfo` object from master. This step was leading to a master crash as we receive a nullptr while fetching TableInfo since the table has been dropped.
Jira: DB-12205

Test Plan: ./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTablesCleanupWhenDropTableCleanupIsDisabled

Reviewers: asrinivasan, stiwary, skumar

Reviewed By: stiwary

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37067
jasonyb pushed a commit that referenced this issue Aug 7, 2024
Summary:
 50931bf [#23273] yugabyted: Fix `yugabyted configure_read_replica` commands.
 64e1bf8 [#23278] CDCSDK: Handle non-eligible tables cleanup with drop table while loading CDC stream
 ce80f7a [#13358] YSQL: Refactor DDL Atomicity Stress Test
 Excluded: 6d40d27 [#23407] YSQL: clean up compound BNL logic
 5cb74a7 [PLAT-14164] New Alert for clock drift
 f39c76c [PLAT-14800] Fix yb.allow_db_version_more_than_yba_version being insufficient for YBA/DB version checks
 a42549e [#23377] DocDB: Implement the way to apply vector index updates to DocDB
 3923ec5 [PLAT-14749][Platform]Add a warning message to image upgrade dialog
 709cd92 [PLAT-14848] postgres.service file did not have RestartSec filled out
 da10672 [#23069] docdb: implemented per-iterator readahead for sequential reads
 f439c8a [PLAT-14852]: Do not raise error when JWT_JWKS_URL has valid value and JWT has empty keyset

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37095
siddharth2411 added a commit that referenced this issue Aug 8, 2024
…with drop table while loading CDC stream

Summary:
**Backport description:**
Minor merge conflicts in test's base class because of missing flag.

**Original description:**
Original commit: 64e1bf8 / D37053
When a table present under a CDC stream is dropped, it is removed from the CDC stream metadata by a background thread.
Suppose before the background thread could cleanup, there was a master restart or a master leadership change. On either of these scenarios, while loading the CDC streams, we check all tables present in the CDC stream metadata for ineligibility. Table schema is one of the objects that is scanned while checking for ineligibility. To get the table schema, we fetch the `TableInfo` object from master. This step was leading to a master crash as we receive a nullptr while fetching TableInfo since the table has been dropped.
Jira: DB-12205

Test Plan: ./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTablesCleanupWhenDropTableCleanupIsDisabled

Reviewers: asrinivasan, stiwary, skumar

Reviewed By: stiwary

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37090
siddharth2411 added a commit that referenced this issue Aug 14, 2024
…th drop table while loading CDC stream

Summary:
**Backport description:**
Faced minor merge conflicts as some code is refactored in latest master.

**Original description:**
Original commit: 64e1bf8 / D37053
When a table present under a CDC stream is dropped, it is removed from the CDC stream metadata by a background thread.
Suppose before the background thread could cleanup, there was a master restart or a master leadership change. On either of these scenarios, while loading the CDC streams, we check all tables present in the CDC stream metadata for ineligibility. Table schema is one of the objects that is scanned while checking for ineligibility. To get the table schema, we fetch the `TableInfo` object from master. This step was leading to a master crash as we receive a nullptr while fetching TableInfo since the table has been dropped.
Jira: DB-12205

Test Plan: ./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTablesCleanupWhenDropTableCleanupIsDisabled

Reviewers: asrinivasan, stiwary, skumar

Reviewed By: stiwary

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37091
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.20.6_blocker area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority
Projects
None yet
Development

No branches or pull requests

4 participants