Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it stable ? #3

Closed
lakinmohapatra opened this issue Nov 8, 2017 · 4 comments
Closed

Is it stable ? #3

lakinmohapatra opened this issue Nov 8, 2017 · 4 comments

Comments

@lakinmohapatra
Copy link

Is it stable ?

@rkarthik007
Copy link
Collaborator

rkarthik007 commented Nov 8, 2017

UPDATE (as of May 2018)

YugaByte DB 1.0 is now generally available and ready for production workloads! Multiple customers are using it to power their business-critical use-cases.

If you have any questions, you can:

===
ORIGINAL:

Yes, YugaByte is stable. We run a wide variety of tests before publishing our releases, including large cluster and stress tests. We are hoping to write a blog post soon on how rigorous our test coverage is.

However, note that we are a 0.9 beta release at the moment - and we have some customers running us over a long time period in a pre-production mode. We will be generally available in a few months time.

Please reach out to us (https://forum.yugabyte.com/) if you are considering us for a use-case. We would be very happy to help and give you more details relevant to your use-case.

@yingfeng
Copy link

Have you passed the jepsen test?

@amitanandaiyer
Copy link
Contributor

amitanandaiyer commented Nov 13, 2017

@yingfeng We are actively working on using Jepsen to validate the YugaByte DB. Please follow the thread on https://forum.yugabyte.com/t/validating-yugabyte-db-with-jepsen to keep up with the updates regarding this.

@schoudhury
Copy link
Contributor

closing this issue for now, pls open a new issue when needed

yugabyte-ci pushed a commit that referenced this issue Feb 2, 2018
… memtable

Summary:
There was a crash during one of our performance integration tests that was caused by Frontiers() not being set on a memtable. That could only possibly happen if the memtable is empty, and it is still not clear how an empty memtable could get into the list of immutable memtables. Regardless of that, instead of crashing, we should just flush that memtable and log an error message.

```
#0  operator() (memtable=..., __closure=0x7f2e454b67b0) at ../../../../../src/yb/tablet/tablet_peer.cc:178
#1  std::_Function_handler<bool(const rocksdb::MemTable&), yb::tablet::TabletPeer::InitTabletPeer(const std::shared_ptr<yb::tablet::enterprise::Tablet>&, const std::shared_future<std::shared_ptr<yb::client::YBClient> >&, const scoped_refptr<yb::server::Clock>&, const std::shared_ptr<yb::rpc::Messenger>&, const scoped_refptr<yb::log::Log>&, const scoped_refptr<yb::MetricEntity>&, yb::ThreadPool*)::<lambda()>::<lambda(const rocksdb::MemTable&)> >::_M_invoke(const std::_Any_data &, const rocksdb::MemTable &) (__functor=..., __args#0=...)  at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:1857
#2  0x00007f2f7346a70e in operator() (__args#0=..., this=0x7f2e454b67b0) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267
#3  rocksdb::MemTableList::PickMemtablesToFlush(rocksdb::autovector<rocksdb::MemTable*, 8ul>*, std::function<bool (rocksdb::MemTable const&)> const&) (this=0x7d02978, ret=ret@entry=0x7f2e454b6370, filter=...)
    at ../../../../../src/yb/rocksdb/db/memtable_list.cc:259
#4  0x00007f2f7345517f in rocksdb::FlushJob::Run (this=this@entry=0x7f2e454b6750, file_meta=file_meta@entry=0x7f2e454b68d0) at ../../../../../src/yb/rocksdb/db/flush_job.cc:143
#5  0x00007f2f7341b7c3 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=this@entry=0x89d2400, cfd=cfd@entry=0x7d02300, mutable_cf_options=..., made_progress=made_progress@entry=0x7f2e454b709e,
    job_context=job_context@entry=0x7f2e454b70b0, log_buffer=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:1586
#6  0x00007f2f7341c19f in rocksdb::DBImpl::BackgroundFlush (this=this@entry=0x89d2400, made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0,
    log_buffer=log_buffer@entry=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2816
#7  0x00007f2f7342539b in rocksdb::DBImpl::BackgroundCallFlush (this=0x89d2400) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2838
#8  0x00007f2f735154c3 in rocksdb::ThreadPool::BGThread (this=0x3b0bb20, thread_id=0) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:133
#9  0x00007f2f73515558 in rocksdb::BGThreadWrapper (arg=0xd970a20) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:157
#10 0x00007f2f6c964694 in start_thread (arg=0x7f2e454b8700) at pthread_create.c:333
```

Test Plan: Jenkins

Reviewers: hector, sergei

Reviewed By: hector, sergei

Subscribers: sergei, bogdan, bharat, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D4044
yugabyte-ci pushed a commit that referenced this issue Nov 30, 2018
…EANUP

Summary:
We were failing to check the return code of the function `LookupTablePeerOrRespond` when CLEANUP request is received by tablet service.
This was causing the following FATAL right after restart during software upgrade on a cluster with SecondaryIndex workload.

```#0  yb::tserver::TabletServiceImpl::CheckMemoryPressure<yb::tserver::UpdateTransactionResponsePB> (this=this@entry=0x24c2e00, tablet=tablet@entry=0x0,
    resp=resp@entry=0x14d3d410, context=context@entry=0x7f55b1eb5600) at ../../src/yb/tserver/tablet_service.cc:222
#1  0x00007f55d4c8a881 in yb::tserver::TabletServiceImpl::UpdateTransaction (this=this@entry=0x24c2e00, req=req@entry=0x1057aa90, resp=resp@entry=0x14d3d410, context=...)
    at ../../src/yb/tserver/tablet_service.cc:431
#2  0x00007f55d273f28a in yb::tserver::TabletServerServiceIf::Handle (this=0x24c2e00, call=...) at src/yb/tserver/tserver_service.service.cc:267
#3  0x00007f55cff0a3ea in yb::rpc::ServicePoolImpl::Handle (this=0x27ca540, incoming=...) at ../../src/yb/rpc/service_pool.cc:214```

Changed LookupTablePeerOrRespond to return complete result using return value.

Test Plan: Update xdc-user-identity and check that is does not crash and workload is stable.

Reviewers: robert, hector, mikhail, kannan

Reviewed By: mikhail, kannan

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D5772
yugabyte-ci pushed a commit that referenced this issue Jun 27, 2019
…data

Summary:
Originally issue has been discovered as `RaftConsensusITest.TestAddRemoveVoter` random test failure in TSAN mode due to a data race.
```
WARNING: ThreadSanitizer: data race (pid=11050)
1663	[ts-2]	   Write of size 8 at 0x7b4c000603a8 by thread T51 (mutexes: write M3613):
...
1674	[ts-2]	     #10 yb::tablet::KvStoreInfo::LoadTablesFromPB(google::protobuf::RepeatedPtrField<yb::tablet::TableInfoPB>, string) src/yb/tablet/tablet_metadata.cc:170
1675	[ts-2]	     #11 yb::tablet::KvStoreInfo::LoadFromPB(yb::tablet::KvStoreInfoPB const&, string) src/yb/tablet/tablet_metadata.cc:189:10
1676	[ts-2]	     #12 yb::tablet::RaftGroupMetadata::LoadFromSuperBlock(yb::tablet::RaftGroupReplicaSuperBlockPB const&) src/yb/tablet/tablet_metadata.cc:508:5
1677	[ts-2]	     #13 yb::tablet::RaftGroupMetadata::ReplaceSuperBlock(yb::tablet::RaftGroupReplicaSuperBlockPB const&) src/yb/tablet/tablet_metadata.cc:545:3
1678	[ts-2]	     #14 yb::tserver::RemoteBootstrapClient::Finish() src/yb/tserver/remote_bootstrap_client.cc:486:3
...
   Previous read of size 4 at 0x7b4c000603a8 by thread T16:
1697	[ts-2]	     #0 yb::tablet::RaftGroupMetadata::schema_version() const src/yb/tablet/tablet_metadata.h:251:34
1698	[ts-2]	     #1 yb::tserver::TSTabletManager::CreateReportedTabletPB(std::__1::shared_ptr<yb::tablet::TabletPeer> const&, yb::master::ReportedTabletPB*) src/yb/tserver/ts_tablet_manager.cc:1323:71
1699	[ts-2]	     #2 yb::tserver::TSTabletManager::GenerateIncrementalTabletReport(yb::master::TabletReportPB*) src/yb/tserver/ts_tablet_manager.cc:1359:5
1700	[ts-2]	     #3 yb::tserver::Heartbeater::Thread::TryHeartbeat() src/yb/tserver/heartbeater.cc:371:32
1701	[ts-2]	     #4 yb::tserver::Heartbeater::Thread::DoHeartbeat() src/yb/tserver/heartbeater.cc:531:19
```

The reason is that although `RaftGroupMetadata::schema_version()` is getting `TableInfo` pointer from `primary_table_info()` under mutex lock, but then it accesses its field without lock.

Added `RaftGroupMetadata::primary_table_info_guarded()` private method which returns a pair of `TableInfo*` and `std::unique_lock` and used it in `RaftGroupMetadata::schema_version()` and other `RaftGroupMetadata` functions accessing primary table info fields.

Test Plan: `ybd tsan --sj --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.TestAddRemoveVoter -n 1000`

Reviewers: bogdan, sergei

Reviewed By: sergei

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D6813
mbautin added a commit that referenced this issue Jul 11, 2019
…ed to the

earlier commit 864e72b

Original commit message:

ENG-2793 Do not fail when deciding if we can flush an empty immutable memtable

Summary:
There was a crash during one of our performance integration tests that was caused by Frontiers() not being set on a memtable. That could only possibly happen if the memtable is empty, and it is still not clear how an empty memtable could get into the list of immutable memtables. Regardless of that, instead of crashing, we should just flush that memtable and log an error message.

```
#0  operator() (memtable=..., __closure=0x7f2e454b67b0) at ../../../../../src/yb/tablet/tablet_peer.cc:178
#1  std::_Function_handler<bool(const rocksdb::MemTable&), yb::tablet::TabletPeer::InitTabletPeer(const std::shared_ptr<yb::tablet::enterprise::Tablet>&, const std::shared_future<std::shared_ptr<yb::client::YBClient> >&, const scoped_refptr<yb::server::Clock>&, const std::shared_ptr<yb::rpc::Messenger>&, const scoped_refptr<yb::log::Log>&, const scoped_refptr<yb::MetricEntity>&, yb::ThreadPool*)::<lambda()>::<lambda(const rocksdb::MemTable&)> >::_M_invoke(const std::_Any_data &, const rocksdb::MemTable &) (__functor=..., __args#0=...)  at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:1857
#2  0x00007f2f7346a70e in operator() (__args#0=..., this=0x7f2e454b67b0) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267
#3  rocksdb::MemTableList::PickMemtablesToFlush(rocksdb::autovector<rocksdb::MemTable*, 8ul>*, std::function<bool (rocksdb::MemTable const&)> const&) (this=0x7d02978, ret=ret@entry=0x7f2e454b6370, filter=...)
    at ../../../../../src/yb/rocksdb/db/memtable_list.cc:259
#4  0x00007f2f7345517f in rocksdb::FlushJob::Run (this=this@entry=0x7f2e454b6750, file_meta=file_meta@entry=0x7f2e454b68d0) at ../../../../../src/yb/rocksdb/db/flush_job.cc:143
#5  0x00007f2f7341b7c3 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=this@entry=0x89d2400, cfd=cfd@entry=0x7d02300, mutable_cf_options=..., made_progress=made_progress@entry=0x7f2e454b709e,
    job_context=job_context@entry=0x7f2e454b70b0, log_buffer=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:1586
#6  0x00007f2f7341c19f in rocksdb::DBImpl::BackgroundFlush (this=this@entry=0x89d2400, made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0,
    log_buffer=log_buffer@entry=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2816
#7  0x00007f2f7342539b in rocksdb::DBImpl::BackgroundCallFlush (this=0x89d2400) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2838
#8  0x00007f2f735154c3 in rocksdb::ThreadPool::BGThread (this=0x3b0bb20, thread_id=0) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:133
#9  0x00007f2f73515558 in rocksdb::BGThreadWrapper (arg=0xd970a20) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:157
#10 0x00007f2f6c964694 in start_thread (arg=0x7f2e454b8700) at pthread_create.c:333
```

Test Plan: Jenkins

Reviewers: hector, sergei

Reviewed By: hector, sergei

Subscribers: sergei, bogdan, bharat, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D4044
mbautin pushed a commit that referenced this issue Jul 11, 2019
…ed to the

earlier commit 566d6d2

Original commit message:

ENG-4240: #613: Fix checking of tablet presence during transaction CLEANUP

Summary:
We were failing to check the return code of the function `LookupTablePeerOrRespond` when CLEANUP request is received by tablet service.
This was causing the following FATAL right after restart during software upgrade on a cluster with SecondaryIndex workload.

```#0  yb::tserver::TabletServiceImpl::CheckMemoryPressure<yb::tserver::UpdateTransactionResponsePB> (this=this@entry=0x24c2e00, tablet=tablet@entry=0x0,
    resp=resp@entry=0x14d3d410, context=context@entry=0x7f55b1eb5600) at ../../src/yb/tserver/tablet_service.cc:222
#1  0x00007f55d4c8a881 in yb::tserver::TabletServiceImpl::UpdateTransaction (this=this@entry=0x24c2e00, req=req@entry=0x1057aa90, resp=resp@entry=0x14d3d410, context=...)
    at ../../src/yb/tserver/tablet_service.cc:431
#2  0x00007f55d273f28a in yb::tserver::TabletServerServiceIf::Handle (this=0x24c2e00, call=...) at src/yb/tserver/tserver_service.service.cc:267
#3  0x00007f55cff0a3ea in yb::rpc::ServicePoolImpl::Handle (this=0x27ca540, incoming=...) at ../../src/yb/rpc/service_pool.cc:214```

Changed LookupTablePeerOrRespond to return complete result using return value.

Test Plan: Update xdc-user-identity and check that is does not crash and workload is stable.

Reviewers: robert, hector, mikhail, kannan

Reviewed By: mikhail, kannan

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D5772
mbautin added a commit to mbautin/yugabyte-db that referenced this issue Jul 16, 2019
… memtable

Summary:
There was a crash during one of our performance integration tests that was caused by Frontiers() not being set on a memtable. That could only possibly happen if the memtable is empty, and it is still not clear how an empty memtable could get into the list of immutable memtables. Regardless of that, instead of crashing, we should just flush that memtable and log an error message.

```
#0  operator() (memtable=..., __closure=0x7f2e454b67b0) at ../../../../../src/yb/tablet/tablet_peer.cc:178
yugabyte#1  std::_Function_handler<bool(const rocksdb::MemTable&), yb::tablet::TabletPeer::InitTabletPeer(const std::shared_ptr<yb::tablet::enterprise::Tablet>&, const std::shared_future<std::shared_ptr<yb::client::YBClient> >&, const scoped_refptr<yb::server::Clock>&, const std::shared_ptr<yb::rpc::Messenger>&, const scoped_refptr<yb::log::Log>&, const scoped_refptr<yb::MetricEntity>&, yb::ThreadPool*)::<lambda()>::<lambda(const rocksdb::MemTable&)> >::_M_invoke(const std::_Any_data &, const rocksdb::MemTable &) (__functor=..., __args#0=...)  at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:1857
yugabyte#2  0x00007f2f7346a70e in operator() (__args#0=..., this=0x7f2e454b67b0) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267
yugabyte#3  rocksdb::MemTableList::PickMemtablesToFlush(rocksdb::autovector<rocksdb::MemTable*, 8ul>*, std::function<bool (rocksdb::MemTable const&)> const&) (this=0x7d02978, ret=ret@entry=0x7f2e454b6370, filter=...)
    at ../../../../../src/yb/rocksdb/db/memtable_list.cc:259
yugabyte#4  0x00007f2f7345517f in rocksdb::FlushJob::Run (this=this@entry=0x7f2e454b6750, file_meta=file_meta@entry=0x7f2e454b68d0) at ../../../../../src/yb/rocksdb/db/flush_job.cc:143
yugabyte#5  0x00007f2f7341b7c3 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=this@entry=0x89d2400, cfd=cfd@entry=0x7d02300, mutable_cf_options=..., made_progress=made_progress@entry=0x7f2e454b709e,
    job_context=job_context@entry=0x7f2e454b70b0, log_buffer=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:1586
yugabyte#6  0x00007f2f7341c19f in rocksdb::DBImpl::BackgroundFlush (this=this@entry=0x89d2400, made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0,
    log_buffer=log_buffer@entry=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2816
yugabyte#7  0x00007f2f7342539b in rocksdb::DBImpl::BackgroundCallFlush (this=0x89d2400) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2838
yugabyte#8  0x00007f2f735154c3 in rocksdb::ThreadPool::BGThread (this=0x3b0bb20, thread_id=0) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:133
yugabyte#9  0x00007f2f73515558 in rocksdb::BGThreadWrapper (arg=0xd970a20) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:157
yugabyte#10 0x00007f2f6c964694 in start_thread (arg=0x7f2e454b8700) at pthread_create.c:333
```

Test Plan: Jenkins

Reviewers: hector, sergei

Reviewed By: hector, sergei

Subscribers: sergei, bogdan, bharat, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D4044

Note:
This commit provides additional functionality that is logically related to
the earlier commit yugabyte@864e72b
and supersedes the commit yugabyte@2932b0a
mbautin pushed a commit to mbautin/yugabyte-db that referenced this issue Jul 16, 2019
…ction CLEANUP

Summary:
We were failing to check the return code of the function `LookupTablePeerOrRespond` when CLEANUP request is received by tablet service.
This was causing the following FATAL right after restart during software upgrade on a cluster with SecondaryIndex workload.

```#0  yb::tserver::TabletServiceImpl::CheckMemoryPressure<yb::tserver::UpdateTransactionResponsePB> (this=this@entry=0x24c2e00, tablet=tablet@entry=0x0,
    resp=resp@entry=0x14d3d410, context=context@entry=0x7f55b1eb5600) at ../../src/yb/tserver/tablet_service.cc:222
yugabyte#1  0x00007f55d4c8a881 in yb::tserver::TabletServiceImpl::UpdateTransaction (this=this@entry=0x24c2e00, req=req@entry=0x1057aa90, resp=resp@entry=0x14d3d410, context=...)
    at ../../src/yb/tserver/tablet_service.cc:431
yugabyte#2  0x00007f55d273f28a in yb::tserver::TabletServerServiceIf::Handle (this=0x24c2e00, call=...) at src/yb/tserver/tserver_service.service.cc:267
yugabyte#3  0x00007f55cff0a3ea in yb::rpc::ServicePoolImpl::Handle (this=0x27ca540, incoming=...) at ../../src/yb/rpc/service_pool.cc:214```

Changed LookupTablePeerOrRespond to return complete result using return value.

Test Plan: Update xdc-user-identity and check that is does not crash and workload is stable.

Reviewers: robert, hector, mikhail, kannan

Reviewed By: mikhail, kannan

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D5772

Note:
This commit provides additional functionality that is logically related to
the earlier commit yugabyte@566d6d2
and supersedes the commit yugabyte@63bae60
bmatican added a commit that referenced this issue Nov 27, 2019
Summary:
Started digging into some TSAN failures in QLTransactionTest.RemoteBootstrap.
Found and fixed a number of issues:

- [#3002] Data race on `yb::log::Log::active_segment_sequence_number()`
Seems this field is protected by a read lock for reads, but was not protected on writes. Turned it
into an atomic.

- [#3007] Race condition between TabletPeer `Init` and `Shutdown`
> std::__1::shared_ptr<yb::tablet::enterprise::Tablet>::get()
> td::__1::shared_ptr<yb::consensus::RaftConsensus>::get()
> src/yb/tablet/tablet_peer.cc:385:17 in yb::tablet::TabletPeer::StartShutdown()
Seems like `Shutdown` did not take the appropriate locks to access either `tablet_` or `consensus_`.

- [#3008] Race condition in thread pool `Worker` Shutdown path:
> #12 yb::rpc::ThreadPool::Impl::Shutdown() /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/rpc/thread_pool.cc:224 (libyrpc.so+0x20c73f)
> #3 yb::rpc::(anonymous namespace)::Worker::Notify() /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/rpc/thread_pool.cc:75 (libyrpc.so+0x20ad6e)
Essentially, we're destroying the vector of workers, but it's possible we still end up trying to notify
them afterwards. Moved some of the code around and expoxed an explicit `Join`. Logic should stay
basically the same. Also moved to shared_ptr instead of raw pointers.

- [#3009] Race condition in Master async RPC task vs CatalogManager reading the task description
> #0 yb::master::PickLeaderReplica::PickReplica(yb::master::TSDescriptor**) /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/master/async_rpc_tasks.cc:95:12 (libmaster.so+0x2450b8)
> #2 yb::master::CatalogManager::SendAddServerRequest(scoped_refptr<yb::master::TabletInfo> const&, yb::consensus::RaftPeerPB_MemberType, yb::consensus::ConsensusStatePB const&, std::__1::basic_string<char, std::__1::char_traits      <char>, std::__1::allocator<char> > const&) /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/master/catalog_manager.cc:5046:54
Just removed the log line..

There are a couple more issues I am still seeing:
- [#3010] Another `Long wait for safe op id`, but one seems like a bootstrap bug
Currently, doing a remote bootstrap triggers an inline OpenTablet in TSTabletManager, unlike the
normal ones, which are scheduled through a thread pool. That causes issues, because on shutdown,
we wait for the threadpool tasks to finish / get aborted. However, when done inline, this exposes
race conditions between Init and Shutdown paths for TabletPeer, RaftConsensus, Log, etc.

- [#3011] SEGV during `DisableFailureDetector`, during raft shutdown
Caused by the same race between Start (which creates the timer) and shutdown, which aborts it.

- [#3012] Log Close failures not flipping the state to closed
> F20191106 04:56:48 ../../src/yb/consensus/log_util.cc:874] Check failed: !IsFooterWritten()
Caused by the same race above. If TSTabletManager starts a remote bootstrap, we open a log. If we
shutdown the tablet manager, before finishing a bootstrap, we wipe the data, but then when we
close the log, we error out as files do not exist anymore.

- [#3013] Race condition in Master async RPC tasks state transitions
> F20191106 05:10:05 ../../src/yb/master/async_rpc_tasks.cc:126] Check failed: task_state == MonitoredTaskState::kWaiting State: kScheduling
Seems like there was a race between scheduling the task to run on the reactor thread and only AFTER
flipping the state from kScheduling. This can be a standalone investigation.

Test Plan: `ybd tsan --cxx-test client_ql-transaction-test --gtest_filter QLTransactionTest.RemoteBootstrap -n 100 --tp 4`

Reviewers: mikhail, sergei

Reviewed By: sergei

Subscribers: hector, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7529
mbautin added a commit that referenced this issue Jan 3, 2020
…up on macOS

Summary:
Add a DNS lookup of the local hostname to postmaster startup to force
macOS network libraries to get initialized before any fork() calls
happen.  This fixes failures of ~20 tests in macOS debug mode. Without
this, PostgreSQL backends would frequently crash with SIGSEGV and dump
cores when trying to do the same DNS lookup.

Here is a SIGSEGV stack trace that we would previously get without this
fix:

```
frame #0: 0x00007fff7bd53e34 libsystem_trace.dylib`_os_log_cmp_key + 4
frame #1: 0x00007fff7bbfcb74 libsystem_c.dylib`rb_tree_find_node + 53
frame #2: 0x00007fff7bd52021 libsystem_trace.dylib`os_log_create + 368
frame #3: 0x00007fff7bc5b127 libsystem_info.dylib`gai_log_init + 23
frame #4: 0x00007fff7bd37ce3 libsystem_pthread.dylib`__pthread_once_handler + 65
frame #5: 0x00007fff7bd2daab libsystem_platform.dylib`_os_once_callout + 18
frame #6: 0x00007fff7bd37c7f libsystem_pthread.dylib`pthread_once + 56
frame #7: 0x00007fff7bc5a4ab libsystem_info.dylib`gai_log + 27
frame #8: 0x00007fff7bc5b33f libsystem_info.dylib`_gai_load_libnetwork_once + 63
frame #9: 0x00007fff7bd37ce3 libsystem_pthread.dylib`__pthread_once_handler + 65
frame #10: 0x00007fff7bd2daab libsystem_platform.dylib`_os_once_callout + 18
frame #11: 0x00007fff7bd37c7f libsystem_pthread.dylib`pthread_once + 56
frame #12: 0x00007fff7bc5b29b libsystem_info.dylib`_gai_load_libnetwork + 27
frame #13: 0x00007fff7bc5b64f libsystem_info.dylib`_gai_nat64_v4_address_requires_synthesis + 31
frame #14: 0x00007fff7bc5aaa0 libsystem_info.dylib`_gai_nat64_second_pass + 512
frame #15: 0x00007fff7bc39847 libsystem_info.dylib`si_addrinfo + 1959
frame #16: 0x00007fff7bc38f77 libsystem_info.dylib`_getaddrinfo_internal + 231
frame #17: 0x00007fff7bc38e7d libsystem_info.dylib`getaddrinfo + 61
frame #18: 0x000000011512f8e5 libyb_util.dylib`yb::GetFQDN(hostname="...") at net_util.cc:371:20
```

Test Plan: Jenkins

Reviewers: mihnea, dmitry

Reviewed By: dmitry

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D7757
rajukumaryb added a commit that referenced this issue Feb 7, 2020
…condition between insert and truncate; disable rocksdb flush on truncate

Summary:
- Prevent rocksdb instance from calling `ListenFilesChanged` callback after being detached from `Tablet`

Full stack available at - #3288 (comment)

Follow Up Work: #3476

- Prevent race between `Tablet::AcquireLocksAndPerformDocOperations() -> Tablet::StartDocWriteOperation()` and `Tablet::Truncate()` by incrementing `pending_op_counter_`

Full stack available at -  #3288 (comment)

- Do not flush rocksdb memtable when user truncates table to prevent following crash on flush -

```
#0  0x000056371c706b00 in ?? ()
#1  0x00007f5229291099 in std::__invoke_impl<void, void (yb::tablet::Tablet::*&)(), yb::tablet::Tablet*&> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/7/bits/invoke.h:73
#2  std::__invoke<void (yb::tablet::Tablet::*&)(), yb::tablet::Tablet*&> (__fn=<optimized out>) at /usr/include/c++/7/bits/invoke.h:95
#3  std::_Bind<void (yb::tablet::Tablet::*(yb::tablet::Tablet*))()>::__call<void, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>) (__args=..., this=<optimized out>) at /usr/include/c++/7/functional:467
#4  std::_Bind<void (yb::tablet::Tablet::*(yb::tablet::Tablet*))()>::operator()<, void>() (this=<optimized out>) at /usr/include/c++/7/functional:551
#5  std::_Function_handler<void (), std::_Bind<void (yb::tablet::Tablet::*(yb::tablet::Tablet*))()> >::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316
#6  0x00007f5225aee92c in std::function<void ()>::operator()() const (this=0x56371c5534f0) at /usr/include/c++/7/bits/std_function.h:706
#7  rocksdb::DBImpl::FilesChanged (this=this@entry=0x56371c552b00) at ../../src/yb/rocksdb/db/db_impl.cc:4359
#8  0x00007f5225b14b1b in rocksdb::DBImpl::BackgroundCallFlush (this=this@entry=0x56371c552b00, cfd=cfd@entry=0x0) at ../../src/yb/rocksdb/db/db_impl.cc:3285
```

Full stack available at - #3288 (comment)

- WORKAROUND for race condition (#3288 (comment)) by using `ScopedPendingOperation` in `Tablet::ShouldApplyWrite()` to prevent it from seeing a null `regular_db_` due to a concurrent `Tablet::Truncate()`

Full stack available at - #3288 (comment)

Follow Up Work: #3477

Test Plan:
./yb_build.sh debug --cxx-test pg_libpq-test --gtest_filter PgLibPqTest.ConcurrentInsertTruncateForeignKey
./yb_build.sh debug --cxx-test snapshot-txn-test --gtest_filter SnapshotTxnTest.MultiWriteWithRestart

Reviewers: bogdan, sergei, mikhail

Reviewed By: mikhail

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7808
spolitov added a commit that referenced this issue Mar 25, 2020
…shutdown

Summary:
In YSQL layer we could run YBCPgGetCatalogMasterVersion and close client immediately.
It results in heap-use-after-free error:

```
[ts-1]     #0 0x7f006c05b5a5 in std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::basic_string(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) (/opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20191209181439-7fc63d1583-centos/installed/asan/libcxx/lib/libc++.so.1+0x2065a5)
[ts-1]     #1 0x7f007d8cc059 in yb::HostPort::HostPort(yb::HostPort const&) /nfusr/centos-gcp-cloud/jenkins-slave-150/jenkins/jenkins-github-yugabyte-db-centos-master-clang-asan-758/build/asan-clang-dynamic-ninja/../../src/yb/util/net/net_util.h:49:7
[ts-1]     #2 0x7f007d8f61b7 in yb::client::YBClient::Data::leader_master_hostport() const /nfusr/centos-gcp-cloud/jenkins-slave-150/jenkins/jenkins-github-yugabyte-db-centos-master-clang-asan-758/build/asan-clang-dynamic-ninja/../../src/yb/client/client-internal.cc:1825:10
[ts-1]     #3 0x7f007d9306a5 in yb::Status yb::client::YBClient::Data::SyncLeaderMasterRpc<yb::master::GetYsqlCatalogConfigRequestPB, yb::master::GetYsqlCatalogConfigResponsePB>(std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, yb::client::YBClient*, yb::master::GetYsqlCatalogConfigRequestPB const&, yb::master::GetYsqlCatalogConfigResponsePB*, int*, char const*, std::__1::function<yb::Status (yb::master::MasterServiceProxy*, yb::master::GetYsqlCatalogConfigRequestPB const&, yb::master::GetYsqlCatalogConfigResponsePB*, yb::rpc::RpcController*)> const&) /nfusr/centos-gcp-cloud/jenkins-slave-150/jenkins/jenkins-github-yugabyte-db-centos-master-clang-asan-758/build/asan-clang-dynamic-ninja/../../src/yb/client/client-internal.cc:192:40
[ts-1]     #4 0x7f007d8abc33 in yb::client::YBClient::GetYsqlCatalogMasterVersion(unsigned long*) /nfusr/centos-gcp-cloud/jenkins-slave-150/jenkins/jenkins-github-yugabyte-db-centos-master-clang-asan-758/build/asan-clang-dynamic-ninja/../../src/yb/client/client.cc:654:3
[ts-1]     #5 0x7f00800a14f5 in yb::pggate::PgSession::GetCatalogMasterVersion(unsigned long*) /nfusr/centos-gcp-cloud/jenkins-slave-150/jenkins/jenkins-github-yugabyte-db-centos-master-clang-asan-758/build/asan-clang-dynamic-ninja/../../src/yb/yql/pggate/pg_session.cc:457:19
[ts-1]     #6 0x7f008007be7c in yb::pggate::PgApiImpl::GetCatalogMasterVersion(unsigned long*) /nfusr/centos-gcp-cloud/jenkins-slave-150/jenkins/jenkins-github-yugabyte-db-centos-master-clang-asan-758/build/asan-clang-dynamic-ninja/../../src/yb/yql/pggate/pggate.cc:342:23
[ts-1]     #7 0x7f0080067f27 in YBCPgGetCatalogMasterVersion /nfusr/centos-gcp-cloud/jenkins-slave-150/jenkins/jenkins-github-yugabyte-db-centos-master-clang-asan-758/build/asan-clang-dynamic-ninja/../../src/yb/yql/pggate/ybc_pggate.cc:179:29
```

Fixed by forcing YBClient to wait for all sync operations to complete.

Test Plan: ybd asan --gtest_filter PgLibPqTest.TxnConflictsForColocatedTables

Reviewers: dmitry

Reviewed By: dmitry

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8178
jasonyb pushed a commit that referenced this issue Jun 11, 2024
jasonyb pushed a commit that referenced this issue Jun 11, 2024
Some formating issues fixed.
jasonyb pushed a commit that referenced this issue Jun 11, 2024
jasonyb pushed a commit that referenced this issue Jun 11, 2024
jasonyb pushed a commit that referenced this issue Jun 11, 2024
jasonyb pushed a commit that referenced this issue Jun 11, 2024
jasonyb pushed a commit that referenced this issue Jun 11, 2024
jasonyb pushed a commit that referenced this issue Jun 11, 2024
jasonyb pushed a commit that referenced this issue Jun 11, 2024
Issue - (#3): README file update.
jasonyb pushed a commit that referenced this issue Jun 11, 2024
Issue - (#16): PG-112: Change the column name "ip" to "client_ip" for readability purpose.
Issue - (#17): PG-111: Show all the queries from complete and incomplete buckets.
Issue - (#18): PG-108: Log the bucket start time.
Issue - (#19): PG-99: Response time histogram.
Issue - (#20): PG-97: Log CPU time for a query.
Issue - (#21): PG-96: Show objects(tables) involved in the query.
Issue - (#22): PG-93: Retain the bucket, and don't delete the bucket automatically.
Issue - (#23): PG-91: Log queries of all the databases.
Issue - (#24): PG-116: Restrict the query size.
Issue - (#3) : README file update.
dr0pdb pushed a commit to dr0pdb/yugabyte-db that referenced this issue Jul 24, 2024
ddhodge added a commit that referenced this issue Jul 30, 2024
…23065)

* initial commit for logical replication docs

* title changes

* changes to view table

* fixed line break

* fixed line break

* added content for delete and update

* added more content

* replaced hyperlink todos with reminders

* added snapshot metrics

* added more content

* added more config properties to docs

* added more config properties to docs

* added more config properties to docs

* replaced postgresql instances with yugabytedb

* added properties

* added complete properties

* changed postgresql to yugabytedb

* added example for all record types

* fixed highlighting of table header

* added type representations

* added type representations

* full content in now;

* full content in now;

* changed postgres references appropriately

* added a missing keyword

* changed name

* self review comments

* self review comments

* added section for logical replication

* added section for logical replication

* modified content for monitor page

* added content for monitoring

* rebased to master;

* CDC logical replication overview (#3)


Co-authored-by: Vaibhav Kushwaha <[email protected]>

* advanced-topic (#5)


Co-authored-by: Vaibhav Kushwaha <[email protected]>

* removed references to incremental and ad-hoc snapshots

* replaced index page with an empty one

* addressed review comments

* added getting started section

* added section for get started

* self review comments

* self review comments

* group review comments

* added hstore and domain type docs

* Advance configurations for CDC using logical replication (#2)

* Fix overview section (#7)

* Monitor section (#4)


Co-authored-by: Vaibhav Kushwaha <[email protected]>

* Initial Snapshot content (#6)

* Add getting started (#1)

* Fix for broken note (#9)

* Fix the issue yaml parsing

Summary:
Fixes the issue yaml parsing. We changed the formatting for yaml list. This diff fixes the
usage for the same.

Test Plan:
Prepared alma9 node using ynp.
Verified universe creation.

Reviewers: vbansal, asharma

Reviewed By: asharma

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D36711

* [PLAT-14534]Add regex match for GCP Instance template

Summary:
Added regex match for gcp instance template.
Regex taken from gcp documentation [[https://cloud.google.com/compute/docs/reference/rest/v1/instanceTemplates | here]].

Test Plan: Tested manually that validation fails with invalid characters.

Reviewers: #yba-api-review!, svarshney

Reviewed By: svarshney

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D36543

* update diagram (#23245)

* [/PLAT-14708] Fix JSON field name in TaskInfo query

Summary: This was missed when task params were moved out from details field.

Test Plan: Trivial - existing tests should succeed.

Reviewers: vbansal, cwang

Reviewed By: vbansal

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D36705

* [#23173] DocDB: Allow large bytes to be passed to RateLimiter

Summary:
RateLimiter has a debug assert that you cannot `Request` more than `GetSingleBurstBytes`. In release mode we do not perform this check and any call gets stuck forever. This change allows large bytes to be requested on RateLimiter. It does so by breaking requests larger than `GetSingleBurstBytes` into multiple smaller requests.

This change is a temporary fix to allow xCluster to operate without any issues. RocksDB RateLimiter has multiple enhancements over the years that would help avoid this and more starvation issues. Ex: facebook/rocksdb@cb2476a. We should consider pulling in those changes.

Fixes #23173
Jira: DB-12112

Test Plan: RateLimiterTest.LargeRequests

Reviewers: slingam

Reviewed By: slingam

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D36703

* [#23179] CDCSDK: Support data types with dynamically alloted oids in CDC

Summary:
This diff adds support for data types with dynamically alloted oids in CDC (for ex: hstore, enum array, etc). Such types contain invalid pg_type_oid for the corresponding columns in docdb schema.

In the current implemtation, in `ybc_pggate`, while decoding the cdc records we look at the `type_map_` to obtain YBCPgTypeEntity, which is then used for decoding. However the `type_map_` does not contain any entries for the data types with dynamically alloted oids. As a result, this causes segmentation fault. To prevent such crashes, CDC prevents addition of tables with such columns to the stream.

This diff removes the filtering logic and adds the tables to the stream even if it has such a type column. A function pointer will now be passed to `YBCPgGetCDCConsistentChanges`, which takes attribute number and the table_oid and returns the appropriate type entity by querying the `pg_type` catalog table. While decoding if a column is encountered with invalid pg_type_oid then, the passed function is invoked and type entity is obtained for decoding.

**Upgrade/Rollback safety:**
This diff adds a field `optional int32 attr_num` to DatumMessagePB. These changes are protected by the autoflag `ysql_yb_enable_replication_slot_consumption` which already exists but has not yet been released.
Jira: DB-12118

Test Plan:
Jenkins: urgent

All the existing cdc tests

./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#replicationConnectionConsumptionAllDataTypesWithYbOutput'

Reviewers: skumar, stiwary, asrinivasan, dmitry

Reviewed By: stiwary, dmitry

Subscribers: steve.varnau, skarri, yql, ybase, ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36689

* [PLAT-14710] Do not return apiToken in response to getSessionInfo

Summary:
**Context**
The GET /session_info YBA API returns:
{
    "authToken": "…",
    "apiToken": "….",
    "apiTokenVersion": "….",
    "customerUUID": "uuid1",
    "userUUID": "useruuid1"
}

The apiToken and apiTokenVersion is supposed to be the last generated token that is valid. We had the following sequence of changes to this API.

https://yugabyte.atlassian.net/browse/PLAT-8028 - Do not store YBA token in YBA.

After the above fix, YBA does not store the apiToken anymore. So it cannot return it as part of the /session_info. The change for this ticket returned the hashed apiToken instead.

https://yugabyte.atlassian.net/browse/PLAT-14672 - getSessionInfo should generate and return api key in response

Since the hashed apiToken value is not useful to any client, and it broke YBM create cluster (https://yugabyte.atlassian.net/browse/CLOUDGA-22117), the first change for this ticket returned a new apiToken instead.

Note that GET /session_info is meant to get customer and user information for the currently authenticated session. This is useful for automation starting off an authenticated session from an existing/cached API token. It is not necessary for the /session_info API to return the authToken and apiToken. The client already has one of authToken or apiToken with which it invoked /session_info API. In fact generating a new apiToken whenever /session_info is called will invalidate the previous apiToken which would not be expected by the client. There is a different API /api_token to regenerate the apiToken explicitly.

**Fix in this change**
So the right behaviour is for /session_info to stop sending the apiToken in the response. In fact, the current behaviour of generating a new apiToken everytime will break a client (for example node-agent usage of /session_info here (https://github.com/yugabyte/yugabyte-db/blob/4ca56cfe27d1cae64e0e61a1bde22406e003ec04/managed/node-agent/app/server/handler.go#L19).

**Client impact of not returning apiToken in response of /session_info**

This should not impact any normal client that was using /session_info only to get the user uuid and customer uuid.

However, there might be a few clients (like YBM for example) that invoked /session_info to get the last generated apiToken from YBA. Unfortunately, this was a mis-use of this API. YBA generates the apiToken in response to a few entry point APIs like /register, /api_login and /api_token. The apiToken is long lived. YBA could choose to expire these apiTokens after a fixed amount of (long) time, but for now there is no expiration. The clients are expected to store the apiToken at their end and use the token to reestablish a session with YBA whenever needed. After establishinig a new session, clients would call GET /session_info to get the user uuid and customer uuid. This is getting fixed in YBM with https://yugabyte.atlassian.net/browse/CLOUDGA-22117. So this PLAT change should be taken up by YBM only after CLOUDGA-22117 is fixed.

Test Plan:
* Manually verified that session_info does not return authToken
* Shubham verified that node-agent works with this fix. Thanks Shubham!

Reviewers: svarshney, dkumar, tbedi, #yba-api-review!

Reviewed By: svarshney

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D36712

* [docs] updates to CVE table status column (#23225)

* updates to status column

* review comment

* format

---------

Co-authored-by: Dwight Hodge <[email protected]>

* [docs] Fix load balance keyword in drivers page (#23253)

[docs] Fix `load_balance` -> `load-balance` in jdbc driver
[docs] Fix `load_balance` -> `loadBalance` in nodejs driver

* fixed compilation

* fix link, format

* format, links

* links, format

* format

* format

* minor edit

* best practice (#8)

* moved sections

* moved pages

* added key concepts page

* added link to getting started

* Dynamic table doc changes (#11)

* icons

* added box for lead link

* revert ybclient change

* revert accidental change

* revert accidental change

* revert accidental change

* fix link block for getting started page

* format

* minor edit

* links, format

* format

* links

* format

* remove reminder references

* Modified output plugin docs (#12)

* Naming edits

* format

* review comments

* diagram

* review comment

* fix links

* format

* format

* link

* review comments

* copy to stable

* link

---------

Co-authored-by: siddharth2411 <[email protected]>
Co-authored-by: Shubham <[email protected]>
Co-authored-by: asharma-yb <[email protected]>
Co-authored-by: Dwight Hodge <[email protected]>
Co-authored-by: Naorem Khogendro Singh <[email protected]>
Co-authored-by: Hari Krishna Sunder <[email protected]>
Co-authored-by: Sumukh-Phalgaonkar <[email protected]>
Co-authored-by: Subramanian Neelakantan <[email protected]>
Co-authored-by: Aishwarya Chakravarthy <[email protected]>
Co-authored-by: Dwight Hodge <[email protected]>
Co-authored-by: ddorian <[email protected]>
Co-authored-by: Sumukh-Phalgaonkar <[email protected]>
myang2021 added a commit that referenced this issue Aug 8, 2024
Summary:
The DDL atomicity stress tests failed more on pg15 branch with an error like:

```
WARNING: ThreadSanitizer: data race (pid=180911)
  Write of size 8 at 0x7b2c000257b8 by thread T17 (mutexes: write M0):
    #0 profile_open_file prof_file.c (libkrb5.so.3+0xf45b3)
    #1 profile_init_flags <null> (libkrb5.so.3+0xfb056)
    #2 k5_os_init_context <null> (libkrb5.so.3+0xe5546)
    #3 krb5_init_context_profile <null> (libkrb5.so.3+0xabc90)
    #4 krb5_init_context <null> (libkrb5.so.3+0xabbd5)
    #5 krb5_gss_init_context init_sec_context.c (libgssapi_krb5.so.2+0x448da)
    #6 acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39159)
    #7 krb5_gss_acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39072)
    #8 gss_add_cred_from <null> (libgssapi_krb5.so.2+0x1fcd3)
    #9 gss_acquire_cred_from <null> (libgssapi_krb5.so.2+0x1f69d)
    #10 gss_acquire_cred <null> (libgssapi_krb5.so.2+0x1f431)
    #11 pg_GSS_have_cred_cache ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-gssapi-common.c:68:10 (libpq.so.5+0x543fe)
    #12 PQconnectPoll ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2909:22 (libpq.so.5+0x359ca)
    #13 connectDBComplete ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2241:10 (libpq.so.5+0x30807)
    #14 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:719:10 (libpq.so.5+0x30af1)
    #15 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:348:24 (libpq_utils.so+0x13c5b)
    #16 yb::pgwrapper::PGConn::Connect(string const&, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.h:254:12 (libpq_utils.so+0x1a77e)
    #17 yb::pgwrapper::PGConnBuilder::Connect(bool) const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:743:10 (libpq_utils.so+0x1a77e)
    #18 yb::pgwrapper::LibPqTestBase::ConnectToDBAsUser(string const&, string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:54:6 (libpg_wrapper_test_base.so+0x26f34)
    #19 yb::pgwrapper::LibPqTestBase::ConnectToDB(string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:44:10 (libpg_wrapper_test_base.so+0x26b1e)
    #20 yb::pgwrapper::LibPqTestBase::Connect(bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:40:10 (libpg_wrapper_test_base.so+0x26b1e)
    #21 yb::pgwrapper::PgDdlAtomicityStressTest::Connect() ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:147:25 (pg_ddl_atomicity_stress-test+0x136d6c)
    #22 yb::pgwrapper::PgDdlAtomicityStressTest::TestDdl(std::vector<string, std::allocator<string>> const&, int) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:165:15 (pg_ddl_atomicity_stress-test+0x136df5)
    #23 yb::pgwrapper::PgDdlAtomicityStressTest_StressTest_Test::TestBody()::$_2::operator()() const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:316:5 (pg_ddl_atomicity_stress-test+0x13d2eb)
```

It appears that the function `yb::pgwrapper::LibPqTestBase::Connect` isn't
thread safe. I restructured the code to make the connections in a single thread
and then pass them to various concurrent threads for testing.
Jira: DB-2996

Test Plan:
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 --clang17

Verified that no more tsan errors.

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D37111
myang2021 added a commit that referenced this issue Aug 8, 2024
… in tsan build

Summary:
The DDL atomicity stress tests failed more on pg15 branch with an error like:

```
WARNING: ThreadSanitizer: data race (pid=180911)
  Write of size 8 at 0x7b2c000257b8 by thread T17 (mutexes: write M0):
    #0 profile_open_file prof_file.c (libkrb5.so.3+0xf45b3)
    #1 profile_init_flags <null> (libkrb5.so.3+0xfb056)
    #2 k5_os_init_context <null> (libkrb5.so.3+0xe5546)
    #3 krb5_init_context_profile <null> (libkrb5.so.3+0xabc90)
    #4 krb5_init_context <null> (libkrb5.so.3+0xabbd5)
    #5 krb5_gss_init_context init_sec_context.c (libgssapi_krb5.so.2+0x448da)
    #6 acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39159)
    #7 krb5_gss_acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39072)
    #8 gss_add_cred_from <null> (libgssapi_krb5.so.2+0x1fcd3)
    #9 gss_acquire_cred_from <null> (libgssapi_krb5.so.2+0x1f69d)
    #10 gss_acquire_cred <null> (libgssapi_krb5.so.2+0x1f431)
    #11 pg_GSS_have_cred_cache ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-gssapi-common.c:68:10 (libpq.so.5+0x543fe)
    #12 PQconnectPoll ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2909:22 (libpq.so.5+0x359ca)
    #13 connectDBComplete ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2241:10 (libpq.so.5+0x30807)
    #14 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:719:10 (libpq.so.5+0x30af1)
    #15 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:348:24 (libpq_utils.so+0x13c5b)
    #16 yb::pgwrapper::PGConn::Connect(string const&, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.h:254:12 (libpq_utils.so+0x1a77e)
    #17 yb::pgwrapper::PGConnBuilder::Connect(bool) const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:743:10 (libpq_utils.so+0x1a77e)
    #18 yb::pgwrapper::LibPqTestBase::ConnectToDBAsUser(string const&, string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:54:6 (libpg_wrapper_test_base.so+0x26f34)
    #19 yb::pgwrapper::LibPqTestBase::ConnectToDB(string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:44:10 (libpg_wrapper_test_base.so+0x26b1e)
    #20 yb::pgwrapper::LibPqTestBase::Connect(bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:40:10 (libpg_wrapper_test_base.so+0x26b1e)
    #21 yb::pgwrapper::PgDdlAtomicityStressTest::Connect() ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:147:25 (pg_ddl_atomicity_stress-test+0x136d6c)
    #22 yb::pgwrapper::PgDdlAtomicityStressTest::TestDdl(std::vector<string, std::allocator<string>> const&, int) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:165:15 (pg_ddl_atomicity_stress-test+0x136df5)
    #23 yb::pgwrapper::PgDdlAtomicityStressTest_StressTest_Test::TestBody()::$_2::operator()() const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:316:5 (pg_ddl_atomicity_stress-test+0x13d2eb)
```

It appears that the function `yb::pgwrapper::LibPqTestBase::Connect` isn't
thread safe. I restructured the code to make the connections in a single thread
and then pass them to various concurrent threads for testing.
Jira: DB-2996

Original commit: bd4874b / D37111

Test Plan:
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 --clang17

Verified that no more tsan errors.

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37167
amitanandaiyer added a commit that referenced this issue Sep 5, 2024
Summary:
Call callback in ScopeExit block only. Not while holding the lock.

Without this fix, it is possible that a thread can get into a deadlock, trying to request a shared_lock on a mutex, while already holding an exclusive lock on the same mutex:

This deadlock can be triggered if there are active read/write requests to a Table (from more than 1 thread)  right after the table had a tablet-split.

 If there is only 1 thread, it is unlikely to run into the deadlock, as the thread notices -- as part of the callback -- that the table's partition info is stale. Having a different thread refresh the partition version before the main thread checks if the table version is stale, is likely necessary to trigger the stack trace seen below.

e.g:
```
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00005640c3eb441b in std::__1::shared_timed_mutex::lock_shared() ()
#2  0x00005640c3ffcbff in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
#3  0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
#4  0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
#5  0x00005640c401aa76 in yb::client::(anonymous namespace)::BatcherFlushDone(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig) ()
#6  0x00005640c401b371 in boost::detail::function::void_function_obj_invoker1<std::__1::__bind<void (*)(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig), std::__1::shared_ptr<yb::client::internal::Batcher> const&, std::__1::placeholders::__ph<1> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig&>, void, yb::Status const&>::invoke(boost::detail::function::function_buffer&, yb::Status const&) ()
#7  0x00005640c3f70398 in yb::client::internal::Batcher::Run() ()
#8  0x00005640c3f72656 in yb::client::internal::Batcher::FlushFinished() ()
#9  0x00005640c3f74a4d in yb::client::internal::Batcher::TabletLookupFinished(yb::client::internal::InFlightOp*, yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> >) ()
#10 0x00005640c3f759bc in std::__1::__function::__func<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0, std::__1::allocator<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0>, void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>::operator()(yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&) ()

#11 0x00005640c3fff05d in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>,  yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
** Is holding an exclusive lock in MetaCache::LookupTabletByKey/DoLookupTabletByKey **

#12 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
#13 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
#14 0x00005640c4017130 in yb::client::YBSession::FlushAsync(boost::function<void (yb::client::FlushStatus*)>) ()
#15 0x00005640c5225a0c in yb::tserver::PgClientServiceImpl::Perform(yb::tserver::PgPerformRequestPB const*, yb::tserver::PgPerformResponsePB*, yb::rpc::RpcContext) ()
#16 0x00005640c51c4487 in std::__1::__function::__func<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20, std::__1::allocator<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) ()
#17 0x00005640c51d374f in yb::tserver::PgClientServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#18 0x00005640c4f5f420 in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#19 0x00005640c4e845af in yb::rpc::InboundCall::InboundCallTask::Run() ()
#20 0x00005640c4f6e243 in yb::rpc::(anonymous namespace)::Worker::Execute() ()
#21 0x00005640c570ecb4 in yb::Thread::SuperviseThread(void*) ()
#22 0x00007f808b7c6694 in start_thread (arg=0x7f76d8caf700) at pthread_create.c:333
#23 0x00007f808bac341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
```
Jira: DB-12651

Test Plan:
Jenkins
yb_build.sh --cxx-test ql-stress-test QLStressTest.ReproMetaCacheDeadlock

Reviewers: rthallam, hsunder, qhu, timur

Reviewed By: hsunder

Subscribers: svc_phabricator, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37706
arpang added a commit that referenced this issue Sep 5, 2024
Summary:
The test CreateInitialSysCatalogSnapshot intermittently fails on asan build with the following stack trace:

   [m-1]     #1 0x5578be7959dc in pg_realloc /home/centos/code/yugabyte-db/src/postgres/src/common/../../../../../src/postgres/src/common/fe_memutils.c:72:8
   [m-1]     #2 0x5578be77b6b1 in readfile /home/centos/code/yugabyte-db/src/postgres/src/bin/initdb/../../../../../../src/postgres/src/bin/initdb/initdb.c:537:23
   [m-1]     #3 0x5578be77837b in bootstrap_template1 /home/centos/code/yugabyte-db/src/postgres/src/bin/initdb/../../../../../../src/postgres/src/bin/initdb/initdb.c:1434:14

This happens because during the execution of `bootstrap_template1()` in initdb, `bki_lines` first points to the memory allocated by readfile(). It then points to the memory allocated by replace_token(), without freeing the memory it was previously pointing to. Fix the issue by freeing up the memory allocated by readfile().

Note that there are more instances of memory leakage in initdb that are not detected by asan runs for some reason. For instance, memory allocated by replace_token() is never freed. These leakages are present in the YB master branch and upstream PG as well. Upstream PG doesn't care about it (https://www.postgresql.org/message-id/28473.1582440206%40sss.pgh.pa.us). The same reasoning applies to YB too. Also to prevent unnecessary deviation from PG code, we can let them remain.

Test Plan:
Jenkins: rebase: pg15
   ./yb_build.sh asan --cxx-test create_initial_sys_catalog_snapshot --gtest_filter CreateInitialSysCatalogSnapshotTest.CreateInitialSysCatalogSnapshot -n 100

Reviewers: jason

Reviewed By: jason

Subscribers: svc_phabricator, yql

Differential Revision: https://phorge.dev.yugabyte.com/D37736
amitanandaiyer added a commit that referenced this issue Sep 6, 2024
…ile holding the lock

Summary:
Original commit: c770d79 / D37706
Call callback in ScopeExit block only. Not while holding the lock.

Without this fix, it is possible that a thread can get into a deadlock, trying to request a shared_lock on a mutex, while already holding an exclusive lock on the same mutex:

This deadlock can be triggered if there are active read/write requests to a Table (from more than 1 thread)  right after the table had a tablet-split.

 If there is only 1 thread, it is unlikely to run into the deadlock, as the thread notices -- as part of the callback -- that the table's partition info is stale. Having a different thread refresh the partition version before the main thread checks if the table version is stale, is likely necessary to trigger the stack trace seen below.

e.g:
```
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00005640c3eb441b in std::__1::shared_timed_mutex::lock_shared() ()
#2  0x00005640c3ffcbff in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
#3  0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
#4  0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
#5  0x00005640c401aa76 in yb::client::(anonymous namespace)::BatcherFlushDone(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig) ()
#6  0x00005640c401b371 in boost::detail::function::void_function_obj_invoker1<std::__1::__bind<void (*)(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig), std::__1::shared_ptr<yb::client::internal::Batcher> const&, std::__1::placeholders::__ph<1> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig&>, void, yb::Status const&>::invoke(boost::detail::function::function_buffer&, yb::Status const&) ()
#7  0x00005640c3f70398 in yb::client::internal::Batcher::Run() ()
#8  0x00005640c3f72656 in yb::client::internal::Batcher::FlushFinished() ()
#9  0x00005640c3f74a4d in yb::client::internal::Batcher::TabletLookupFinished(yb::client::internal::InFlightOp*, yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> >) ()
#10 0x00005640c3f759bc in std::__1::__function::__func<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0, std::__1::allocator<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0>, void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>::operator()(yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&) ()

#11 0x00005640c3fff05d in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>,  yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
** Is holding an exclusive lock in MetaCache::LookupTabletByKey/DoLookupTabletByKey **

#12 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
#13 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
#14 0x00005640c4017130 in yb::client::YBSession::FlushAsync(boost::function<void (yb::client::FlushStatus*)>) ()
#15 0x00005640c5225a0c in yb::tserver::PgClientServiceImpl::Perform(yb::tserver::PgPerformRequestPB const*, yb::tserver::PgPerformResponsePB*, yb::rpc::RpcContext) ()
#16 0x00005640c51c4487 in std::__1::__function::__func<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20, std::__1::allocator<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) ()
#17 0x00005640c51d374f in yb::tserver::PgClientServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#18 0x00005640c4f5f420 in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#19 0x00005640c4e845af in yb::rpc::InboundCall::InboundCallTask::Run() ()
#20 0x00005640c4f6e243 in yb::rpc::(anonymous namespace)::Worker::Execute() ()
#21 0x00005640c570ecb4 in yb::Thread::SuperviseThread(void*) ()
#22 0x00007f808b7c6694 in start_thread (arg=0x7f76d8caf700) at pthread_create.c:333
#23 0x00007f808bac341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
```
Jira: DB-12651

Test Plan:
Jenkins
yb_build.sh --cxx-test ql-stress-test QLStressTest.ReproMetaCacheDeadlock

Reviewers: rthallam, hsunder, qhu, timur

Reviewed By: rthallam

Subscribers: ybase, svc_phabricator

Differential Revision: https://phorge.dev.yugabyte.com/D37788
amitanandaiyer added a commit that referenced this issue Sep 6, 2024
…e holding the lock

Summary:
Original commit: c770d79 / D37706
Call callback in ScopeExit block only. Not while holding the lock.

Without this fix, it is possible that a thread can get into a deadlock, trying to request a shared_lock on a mutex, while already holding an exclusive lock on the same mutex:

This deadlock can be triggered if there are active read/write requests to a Table (from more than 1 thread)  right after the table had a tablet-split.

 If there is only 1 thread, it is unlikely to run into the deadlock, as the thread notices -- as part of the callback -- that the table's partition info is stale. Having a different thread refresh the partition version before the main thread checks if the table version is stale, is likely necessary to trigger the stack trace seen below.

e.g:
```
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00005640c3eb441b in std::__1::shared_timed_mutex::lock_shared() ()
#2  0x00005640c3ffcbff in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
#3  0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
#4  0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
#5  0x00005640c401aa76 in yb::client::(anonymous namespace)::BatcherFlushDone(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig) ()
#6  0x00005640c401b371 in boost::detail::function::void_function_obj_invoker1<std::__1::__bind<void (*)(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig), std::__1::shared_ptr<yb::client::internal::Batcher> const&, std::__1::placeholders::__ph<1> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig&>, void, yb::Status const&>::invoke(boost::detail::function::function_buffer&, yb::Status const&) ()
#7  0x00005640c3f70398 in yb::client::internal::Batcher::Run() ()
#8  0x00005640c3f72656 in yb::client::internal::Batcher::FlushFinished() ()
#9  0x00005640c3f74a4d in yb::client::internal::Batcher::TabletLookupFinished(yb::client::internal::InFlightOp*, yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> >) ()
#10 0x00005640c3f759bc in std::__1::__function::__func<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0, std::__1::allocator<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0>, void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>::operator()(yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&) ()

#11 0x00005640c3fff05d in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>,  yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
** Is holding an exclusive lock in MetaCache::LookupTabletByKey/DoLookupTabletByKey **

#12 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
#13 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
#14 0x00005640c4017130 in yb::client::YBSession::FlushAsync(boost::function<void (yb::client::FlushStatus*)>) ()
#15 0x00005640c5225a0c in yb::tserver::PgClientServiceImpl::Perform(yb::tserver::PgPerformRequestPB const*, yb::tserver::PgPerformResponsePB*, yb::rpc::RpcContext) ()
#16 0x00005640c51c4487 in std::__1::__function::__func<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20, std::__1::allocator<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) ()
#17 0x00005640c51d374f in yb::tserver::PgClientServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#18 0x00005640c4f5f420 in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#19 0x00005640c4e845af in yb::rpc::InboundCall::InboundCallTask::Run() ()
#20 0x00005640c4f6e243 in yb::rpc::(anonymous namespace)::Worker::Execute() ()
#21 0x00005640c570ecb4 in yb::Thread::SuperviseThread(void*) ()
#22 0x00007f808b7c6694 in start_thread (arg=0x7f76d8caf700) at pthread_create.c:333
#23 0x00007f808bac341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
```
Jira: DB-12651

Test Plan:
Jenkins
yb_build.sh --cxx-test ql-stress-test QLStressTest.ReproMetaCacheDeadlock

Reviewers: rthallam, hsunder, qhu, timur

Reviewed By: rthallam

Subscribers: svc_phabricator, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37789
amitanandaiyer added a commit that referenced this issue Sep 6, 2024
…ile holding the lock

Summary:
Original commit: c770d79 / D37706
Call callback in ScopeExit block only. Not while holding the lock.

Without this fix, it is possible that a thread can get into a deadlock, trying to request a shared_lock on a mutex, while already holding an exclusive lock on the same mutex:

This deadlock can be triggered if there are active read/write requests to a Table (from more than 1 thread)  right after the table had a tablet-split.

 If there is only 1 thread, it is unlikely to run into the deadlock, as the thread notices -- as part of the callback -- that the table's partition info is stale. Having a different thread refresh the partition version before the main thread checks if the table version is stale, is likely necessary to trigger the stack trace seen below.

e.g:
```
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00005640c3eb441b in std::__1::shared_timed_mutex::lock_shared() ()
#2  0x00005640c3ffcbff in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
#3  0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
#4  0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
#5  0x00005640c401aa76 in yb::client::(anonymous namespace)::BatcherFlushDone(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig) ()
#6  0x00005640c401b371 in boost::detail::function::void_function_obj_invoker1<std::__1::__bind<void (*)(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig), std::__1::shared_ptr<yb::client::internal::Batcher> const&, std::__1::placeholders::__ph<1> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig&>, void, yb::Status const&>::invoke(boost::detail::function::function_buffer&, yb::Status const&) ()
#7  0x00005640c3f70398 in yb::client::internal::Batcher::Run() ()
#8  0x00005640c3f72656 in yb::client::internal::Batcher::FlushFinished() ()
#9  0x00005640c3f74a4d in yb::client::internal::Batcher::TabletLookupFinished(yb::client::internal::InFlightOp*, yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> >) ()
#10 0x00005640c3f759bc in std::__1::__function::__func<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0, std::__1::allocator<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0>, void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>::operator()(yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&) ()

#11 0x00005640c3fff05d in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>,  yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) ()
** Is holding an exclusive lock in MetaCache::LookupTabletByKey/DoLookupTabletByKey **

#12 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) ()
#13 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) ()
#14 0x00005640c4017130 in yb::client::YBSession::FlushAsync(boost::function<void (yb::client::FlushStatus*)>) ()
#15 0x00005640c5225a0c in yb::tserver::PgClientServiceImpl::Perform(yb::tserver::PgPerformRequestPB const*, yb::tserver::PgPerformResponsePB*, yb::rpc::RpcContext) ()
#16 0x00005640c51c4487 in std::__1::__function::__func<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20, std::__1::allocator<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) ()
#17 0x00005640c51d374f in yb::tserver::PgClientServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#18 0x00005640c4f5f420 in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#19 0x00005640c4e845af in yb::rpc::InboundCall::InboundCallTask::Run() ()
#20 0x00005640c4f6e243 in yb::rpc::(anonymous namespace)::Worker::Execute() ()
#21 0x00005640c570ecb4 in yb::Thread::SuperviseThread(void*) ()
#22 0x00007f808b7c6694 in start_thread (arg=0x7f76d8caf700) at pthread_create.c:333
#23 0x00007f808bac341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
```
Jira: DB-12651

Test Plan:
Jenkins
yb_build.sh --cxx-test ql-stress-test QLStressTest.ReproMetaCacheDeadlock

Reviewers: rthallam, hsunder, qhu, timur

Reviewed By: rthallam

Subscribers: ybase, svc_phabricator

Differential Revision: https://phorge.dev.yugabyte.com/D37831
es1024 added a commit that referenced this issue Sep 25, 2024
Summary:
It is possible for tablet peer's `tablet_` to be null when a rocksdb flush finishes. We call `tablet_->MaxPersistentOpId()` after flush to clean up recently applied transaction state, and this causes a SIGSEGV:
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(yb::RWOperationCounter*, yb::StatusHolder const*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::basic_string(this="", __str=<unavailable>) at string:898:9
    frame #1: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(yb::RWOperationCounter*, yb::StatusHolder const*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) [inlined] yb::RWOperationCounter::resource_name(this=0x0000000000000378) const at operation_counter.h:95:12
    frame #2: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(this=0x00007f9455305d58, counter=0x0000000000000378, abort_status_holder=<unavailable>, deadline=0x00007f9455305d98) at operation_counter.cc:190:62
    frame #3: 0x000055885b247ea6 yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(bool) const [inlined] yb::ScopedRWOperation::ScopedRWOperation(this=0x00007f9455305d58, counter=<unavailable>, deadline=0x00007f9455305d98) at operation_counter.h:140:9
    frame #4: 0x000055885b247e9f yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(bool) const [inlined] yb::tablet::Tablet::CreateScopedRWOperationBlockingRocksDbShutdownStart(this=0x0000000000000000, deadline=yb::CoarseTimePoint @ 0x00007f9455305d98) const at tablet.cc:3375:10
    frame #5: 0x000055885b247e90 yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(this=0x0000000000000000, invalid_if_no_new_data=<unavailable>) const at tablet.cc:3540:32
    frame #6: 0x000055885b277f5e yb-tserver`yb::tablet::TabletPeer::MaxPersistentOpId(this=<unavailable>) const at tablet_peer.cc:946:23
    frame #7: 0x000055885b278e52 yb-tserver`non-virtual thunk to yb::tablet::TabletPeer::MaxPersistentOpId() const at tablet_peer.cc:0
    frame #8: 0x000055885b2dec44 yb-tserver`yb::tablet::TransactionParticipant::Impl::DoProcessRecentlyAppliedTransactions(this=0x0000153123151500, retryable_requests_flushed_op_id=<unavailable>, persist=<unavailable>) at transaction_participant.cc:2186:22
    frame #9: 0x000055885b2e0a8e yb-tserver`yb::tablet::TransactionParticipant::ProcessRecentlyAppliedTransactions() [inlined] yb::tablet::TransactionParticipant::Impl::ProcessRecentlyAppliedTransactions(this=0x0000153123151500) at transaction_participant.cc:1440:27
    frame #10: 0x000055885b2e0a63 yb-tserver`yb::tablet::TransactionParticipant::ProcessRecentlyAppliedTransactions(this=<unavailable>) at transaction_participant.cc:2629:17
    frame #11: 0x000055885b226093 yb-tserver`yb::tablet::Tablet::RocksDbListener::OnFlushCompleted(this=0x0000153110c2da58, (null)=<unavailable>, (null)=<unavailable>) at tablet.cc:503:34
    frame #12: 0x000055885af0e507 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) at db_impl.cc:2121:19
    frame #13: 0x000055885af0e275 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) [inlined] rocksdb::DBImpl::FlushMemTableToOutputFile(this=0x0000153123150a80, cfd=0x000015317d651600, mutable_cf_options=0x00007f94553077d8, made_progress=<unavailable>, job_context=0x00007f9455306938, log_buffer=0x00007f9455306048) at db_impl.cc:2008:3
    frame #14: 0x000055885af0d859 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) [inlined] rocksdb::DBImpl::BackgroundFlush(this=0x0000153123150a80, made_progress=<unavailable>, job_context=0x00007f9455306938, log_buffer=0x00007f9455306048, cfd=0x000015317d651600) at db_impl.cc:3399:10
    frame #15: 0x000055885af0d21f yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(this=0x0000153123150a80, cfd=<unavailable>) at db_impl.cc:3470:31
    frame #16: 0x000055885b024a53 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() at thread_posix.cc:133:5
    frame #17: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] rocksdb::ThreadPool::StartBGThreads(this=<unavailable>)::$_0::operator()() const at thread_posix.cc:172:5
    frame #18: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] decltype(__f=<unavailable>)::$_0&>()()) std::__1::__invoke[abi:ue170006]<rocksdb::ThreadPool::StartBGThreads()::$_0&>(rocksdb::ThreadPool::StartBGThreads()::$_0&) at invoke.h:340:25
    frame #19: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<rocksdb::ThreadPool::StartBGThreads(__args=<unavailable>)::$_0&>(rocksdb::ThreadPool::StartBGThreads()::$_0&) at invoke.h:415:5
    frame #20: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] std::__1::__function::__alloc_func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator(this=<unavailable>)[abi:ue170006]() at function.h:192:16
    frame #21: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator(this=<unavailable>)() at function.h:363:12
    frame #22: 0x000055885b9c1543 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x000015313de3b380)[abi:ue170006]() const at function.h:517:16
    frame #23: 0x000055885b9c152d yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator(this=0x000015313de3b380)() const at function.h:1168:12
    frame #24: 0x000055885b9c152d yb-tserver`yb::Thread::SuperviseThread(arg=0x000015313de3b320) at thread.cc:866:3
    frame #25: 0x00007f94994d81ca libpthread.so.0`start_thread + 234
    frame #26: 0x00007f9499729e73 libc.so.6`__clone + 67
```

This diff adds a null check and returns `OpId::Min()` (i.e. don't clean anything up) if `tablet_` is null and we cannot call `MaxPersistentOpId`.
Jira: DB-12915

Test Plan: Jenkins

Reviewers: sergei, rthallam

Reviewed By: sergei, rthallam

Subscribers: rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38323
es1024 added a commit that referenced this issue Sep 26, 2024
…fter flush

Summary:
Original commit: 250a4d5 / D38323
It is possible for tablet peer's `tablet_` to be null when a rocksdb flush finishes. We call `tablet_->MaxPersistentOpId()` after flush to clean up recently applied transaction state, and this causes a SIGSEGV:
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(yb::RWOperationCounter*, yb::StatusHolder const*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::basic_string(this="", __str=<unavailable>) at string:898:9
    frame #1: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(yb::RWOperationCounter*, yb::StatusHolder const*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) [inlined] yb::RWOperationCounter::resource_name(this=0x0000000000000378) const at operation_counter.h:95:12
    frame #2: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(this=0x00007f9455305d58, counter=0x0000000000000378, abort_status_holder=<unavailable>, deadline=0x00007f9455305d98) at operation_counter.cc:190:62
    frame #3: 0x000055885b247ea6 yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(bool) const [inlined] yb::ScopedRWOperation::ScopedRWOperation(this=0x00007f9455305d58, counter=<unavailable>, deadline=0x00007f9455305d98) at operation_counter.h:140:9
    frame #4: 0x000055885b247e9f yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(bool) const [inlined] yb::tablet::Tablet::CreateScopedRWOperationBlockingRocksDbShutdownStart(this=0x0000000000000000, deadline=yb::CoarseTimePoint @ 0x00007f9455305d98) const at tablet.cc:3375:10
    frame #5: 0x000055885b247e90 yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(this=0x0000000000000000, invalid_if_no_new_data=<unavailable>) const at tablet.cc:3540:32
    frame #6: 0x000055885b277f5e yb-tserver`yb::tablet::TabletPeer::MaxPersistentOpId(this=<unavailable>) const at tablet_peer.cc:946:23
    frame #7: 0x000055885b278e52 yb-tserver`non-virtual thunk to yb::tablet::TabletPeer::MaxPersistentOpId() const at tablet_peer.cc:0
    frame #8: 0x000055885b2dec44 yb-tserver`yb::tablet::TransactionParticipant::Impl::DoProcessRecentlyAppliedTransactions(this=0x0000153123151500, retryable_requests_flushed_op_id=<unavailable>, persist=<unavailable>) at transaction_participant.cc:2186:22
    frame #9: 0x000055885b2e0a8e yb-tserver`yb::tablet::TransactionParticipant::ProcessRecentlyAppliedTransactions() [inlined] yb::tablet::TransactionParticipant::Impl::ProcessRecentlyAppliedTransactions(this=0x0000153123151500) at transaction_participant.cc:1440:27
    frame #10: 0x000055885b2e0a63 yb-tserver`yb::tablet::TransactionParticipant::ProcessRecentlyAppliedTransactions(this=<unavailable>) at transaction_participant.cc:2629:17
    frame #11: 0x000055885b226093 yb-tserver`yb::tablet::Tablet::RocksDbListener::OnFlushCompleted(this=0x0000153110c2da58, (null)=<unavailable>, (null)=<unavailable>) at tablet.cc:503:34
    frame #12: 0x000055885af0e507 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) at db_impl.cc:2121:19
    frame #13: 0x000055885af0e275 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) [inlined] rocksdb::DBImpl::FlushMemTableToOutputFile(this=0x0000153123150a80, cfd=0x000015317d651600, mutable_cf_options=0x00007f94553077d8, made_progress=<unavailable>, job_context=0x00007f9455306938, log_buffer=0x00007f9455306048) at db_impl.cc:2008:3
    frame #14: 0x000055885af0d859 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) [inlined] rocksdb::DBImpl::BackgroundFlush(this=0x0000153123150a80, made_progress=<unavailable>, job_context=0x00007f9455306938, log_buffer=0x00007f9455306048, cfd=0x000015317d651600) at db_impl.cc:3399:10
    frame #15: 0x000055885af0d21f yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(this=0x0000153123150a80, cfd=<unavailable>) at db_impl.cc:3470:31
    frame #16: 0x000055885b024a53 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() at thread_posix.cc:133:5
    frame #17: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] rocksdb::ThreadPool::StartBGThreads(this=<unavailable>)::$_0::operator()() const at thread_posix.cc:172:5
    frame #18: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] decltype(__f=<unavailable>)::$_0&>()()) std::__1::__invoke[abi:ue170006]<rocksdb::ThreadPool::StartBGThreads()::$_0&>(rocksdb::ThreadPool::StartBGThreads()::$_0&) at invoke.h:340:25
    frame #19: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<rocksdb::ThreadPool::StartBGThreads(__args=<unavailable>)::$_0&>(rocksdb::ThreadPool::StartBGThreads()::$_0&) at invoke.h:415:5
    frame #20: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] std::__1::__function::__alloc_func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator(this=<unavailable>)[abi:ue170006]() at function.h:192:16
    frame #21: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator(this=<unavailable>)() at function.h:363:12
    frame #22: 0x000055885b9c1543 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x000015313de3b380)[abi:ue170006]() const at function.h:517:16
    frame #23: 0x000055885b9c152d yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator(this=0x000015313de3b380)() const at function.h:1168:12
    frame #24: 0x000055885b9c152d yb-tserver`yb::Thread::SuperviseThread(arg=0x000015313de3b320) at thread.cc:866:3
    frame #25: 0x00007f94994d81ca libpthread.so.0`start_thread + 234
    frame #26: 0x00007f9499729e73 libc.so.6`__clone + 67
```

This diff adds a null check and returns `OpId::Min()` (i.e. don't clean anything up) if `tablet_` is null and we cannot call `MaxPersistentOpId`.
Jira: DB-12915

Test Plan: Jenkins

Reviewers: sergei, rthallam

Reviewed By: rthallam

Subscribers: ybase, rthallam

Differential Revision: https://phorge.dev.yugabyte.com/D38431
pao214 added a commit that referenced this issue Oct 23, 2024
Summary:
### Issue

Test ClockSynchronizationTest.TestClockSkewError fails with tsan failure

```
WARNING: ThreadSanitizer: data race (pid=226462)
  Read of size 8 at 0x7b4000000bf0 by thread T82:
    #0 boost::intrusive_ptr<yb::Status::State>::get() const ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:181:16 (libyb_util.so+0x3c5994)
    #1 bool boost::operator==<yb::Status::State>(boost::intrusive_ptr<yb::Status::State> const&, std::nullptr_t) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:263:14 (libyb_util.so+0x3c5994)
    #2 yb::Status::ok() const ${YB_SRC_ROOT}/src/yb/util/status.h:120:51 (libyb_util.so+0x3c5994)
    #3 yb::MockClock::Now() ${YB_SRC_ROOT}/src/yb/util/physical_time.cc:141:3 (libyb_util.so+0x3c5994)
    #4 yb::server::HybridClock::NowWithError(yb::HybridTime*, unsigned long*) ${YB_SRC_ROOT}/src/yb/server/hybrid_clock.cc:155:22 (libserver_common.so+0xa5e12)
    #5 yb::server::HybridClock::NowRange() ${YB_SRC_ROOT}/src/yb/server/hybrid_clock.cc:144:3 (libserver_common.so+0xa5ceb)
    #6 yb::ClockBase::Now() ${YB_SRC_ROOT}/src/yb/common/clock.h:26:29 (libtserver.so+0x23a77a)
    #7 yb::tserver::Heartbeater::Thread::TryHeartbeat() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:437:41 (libtserver.so+0x23a77a)
    #8 yb::tserver::Heartbeater::Thread::DoHeartbeat() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:650:19 (libtserver.so+0x23d05f)
    #9 yb::tserver::Heartbeater::Thread::RunThread() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:697:16 (libtserver.so+0x23d74d)
    #10 decltype(*std::declval<yb::tserver::Heartbeater::Thread*&>().*std::declval<void (yb::tserver::Heartbeater::Thread::*&)()>()()) std::__invoke[abi:ue170006]<void (yb::tserver::Heartbeater::Thread::*&)(), yb::tserver::Heartbeater::Thread*&, void>(void (yb::tserver::Heartbeater::Thread::*&)(), yb::tserver::Heartbeater::Thread*&) ${YB_THIRDPARTY_DIR}/installed/tsan/libcxx/include/c++/v1/__type_traits/invoke.h:308:25 (libtserver.so+0x24206b)
...

  Previous write of size 8 at 0x7b4000000bf0 by main thread:
    #0 boost::intrusive_ptr<yb::Status::State>::swap(boost::intrusive_ptr<yb::Status::State>&) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:210:16 (libyb_util.so+0x3c5c54)
    #1 boost::intrusive_ptr<yb::Status::State>::operator=(boost::intrusive_ptr<yb::Status::State>&&) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:122:61 (libyb_util.so+0x3c5c54)
    #2 yb::Status::operator=(yb::Status&&) ${YB_SRC_ROOT}/src/yb/util/status.h:98:7 (libyb_util.so+0x3c5c54)
    #3 yb::MockClock::Set(yb::PhysicalTime const&) ${YB_SRC_ROOT}/src/yb/util/physical_time.cc:147:16 (libyb_util.so+0x3c5c54)
    #4 yb::ClockSynchronizationTest_TestClockSkewError_Test::TestBody() ${YB_SRC_ROOT}/src/yb/integration-tests/clock_synchronization-itest.cc:131:15 (clock_synchronization-itest+0x12e3ca)
    #5 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2599:10 (libgtest.so.1.12.1+0x894f9)
    #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2635:14 (libgtest.so.1.12.1+0x894f9)
    #7 testing::Test::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2674:5 (libgtest.so.1.12.1+0x6123f)
    #8 testing::TestInfo::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2853:11 (libgtest.so.1.12.1+0x62a05)
    #9 testing::TestSuite::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:3012:30 (libgtest.so.1.12.1+0x63f04)
    #10 testing::internal::UnitTestImpl::RunAllTests() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:5870:44 (libgtest.so.1.12.1+0x7be3d)
...

**SUMMARY**: ThreadSanitizer: data race ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:181:16 in boost::intrusive_ptr<yb::Status::State>::get() const
```

### Fix

Do what value_ does => wrap mock_status_ in boost::atomic.
Jira: DB-13604

Test Plan:
Jenkins

Ran

```
./yb_build.sh tsan --cxx-test integration-tests_clock_synchronization-itest --gtest_filter ClockSynchronizationTest.TestClockSkewError -n 50
```

Reviewers: asrivastava

Reviewed By: asrivastava

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39315
pao214 added a commit that referenced this issue Oct 24, 2024
Summary:
Original commit: 21d7ad3 / D39315
### Issue

Test ClockSynchronizationTest.TestClockSkewError fails with tsan failure

```
WARNING: ThreadSanitizer: data race (pid=226462)
  Read of size 8 at 0x7b4000000bf0 by thread T82:
    #0 boost::intrusive_ptr<yb::Status::State>::get() const ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:181:16 (libyb_util.so+0x3c5994)
    #1 bool boost::operator==<yb::Status::State>(boost::intrusive_ptr<yb::Status::State> const&, std::nullptr_t) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:263:14 (libyb_util.so+0x3c5994)
    #2 yb::Status::ok() const ${YB_SRC_ROOT}/src/yb/util/status.h:120:51 (libyb_util.so+0x3c5994)
    #3 yb::MockClock::Now() ${YB_SRC_ROOT}/src/yb/util/physical_time.cc:141:3 (libyb_util.so+0x3c5994)
    #4 yb::server::HybridClock::NowWithError(yb::HybridTime*, unsigned long*) ${YB_SRC_ROOT}/src/yb/server/hybrid_clock.cc:155:22 (libserver_common.so+0xa5e12)
    #5 yb::server::HybridClock::NowRange() ${YB_SRC_ROOT}/src/yb/server/hybrid_clock.cc:144:3 (libserver_common.so+0xa5ceb)
    #6 yb::ClockBase::Now() ${YB_SRC_ROOT}/src/yb/common/clock.h:26:29 (libtserver.so+0x23a77a)
    #7 yb::tserver::Heartbeater::Thread::TryHeartbeat() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:437:41 (libtserver.so+0x23a77a)
    #8 yb::tserver::Heartbeater::Thread::DoHeartbeat() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:650:19 (libtserver.so+0x23d05f)
    #9 yb::tserver::Heartbeater::Thread::RunThread() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:697:16 (libtserver.so+0x23d74d)
    #10 decltype(*std::declval<yb::tserver::Heartbeater::Thread*&>().*std::declval<void (yb::tserver::Heartbeater::Thread::*&)()>()()) std::__invoke[abi:ue170006]<void (yb::tserver::Heartbeater::Thread::*&)(), yb::tserver::Heartbeater::Thread*&, void>(void (yb::tserver::Heartbeater::Thread::*&)(), yb::tserver::Heartbeater::Thread*&) ${YB_THIRDPARTY_DIR}/installed/tsan/libcxx/include/c++/v1/__type_traits/invoke.h:308:25 (libtserver.so+0x24206b)
...

  Previous write of size 8 at 0x7b4000000bf0 by main thread:
    #0 boost::intrusive_ptr<yb::Status::State>::swap(boost::intrusive_ptr<yb::Status::State>&) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:210:16 (libyb_util.so+0x3c5c54)
    #1 boost::intrusive_ptr<yb::Status::State>::operator=(boost::intrusive_ptr<yb::Status::State>&&) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:122:61 (libyb_util.so+0x3c5c54)
    #2 yb::Status::operator=(yb::Status&&) ${YB_SRC_ROOT}/src/yb/util/status.h:98:7 (libyb_util.so+0x3c5c54)
    #3 yb::MockClock::Set(yb::PhysicalTime const&) ${YB_SRC_ROOT}/src/yb/util/physical_time.cc:147:16 (libyb_util.so+0x3c5c54)
    #4 yb::ClockSynchronizationTest_TestClockSkewError_Test::TestBody() ${YB_SRC_ROOT}/src/yb/integration-tests/clock_synchronization-itest.cc:131:15 (clock_synchronization-itest+0x12e3ca)
    #5 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2599:10 (libgtest.so.1.12.1+0x894f9)
    #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2635:14 (libgtest.so.1.12.1+0x894f9)
    #7 testing::Test::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2674:5 (libgtest.so.1.12.1+0x6123f)
    #8 testing::TestInfo::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2853:11 (libgtest.so.1.12.1+0x62a05)
    #9 testing::TestSuite::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:3012:30 (libgtest.so.1.12.1+0x63f04)
    #10 testing::internal::UnitTestImpl::RunAllTests() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:5870:44 (libgtest.so.1.12.1+0x7be3d)
...

**SUMMARY**: ThreadSanitizer: data race ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:181:16 in boost::intrusive_ptr<yb::Status::State>::get() const
```

### Fix

Use a mutex to prevent data race on mock_status_.

Jira: DB-13604

Test Plan:
Jenkins

Ran

```
./yb_build.sh tsan --cxx-test integration-tests_clock_synchronization-itest --gtest_filter ClockSynchronizationTest.TestClockSkewError -n 50
```

Backport-through: 2024.2

Reviewers: asrivastava

Reviewed By: asrivastava

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39359
manav-yb pushed a commit that referenced this issue Nov 20, 2024
…tion manager

Summary:
Set pthread_attr_setstacksize to 512 KB in ysql connection manager. This is to fix crashes involving Alma 9 machines.
```
#0  0x000055e430191fd7 in tcmalloc::tcmalloc_internal::PageTracker::Get(tcmalloc::tcmalloc_internal::Length) ()
#1  0x000055e430192774 in tcmalloc::tcmalloc_internal::HugePageFiller<tcmalloc::tcmalloc_internal::PageTracker>::TryGet(tcmalloc::tcmalloc_internal::Length, unsigned long)
    ()
#2  0x000055e430160b3a in tcmalloc::tcmalloc_internal::HugePageAwareAllocator::New(tcmalloc::tcmalloc_internal::Length, unsigned long) ()
#3  0x000055e4301437a2 in void* tcmalloc::tcmalloc_internal::SampleifyAllocation<tcmalloc::tcmalloc_internal::Static, tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy> >(tcmalloc::tcmalloc_internal::Static&, tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, unsigned long, unsigned long, unsigned long, void*, tcmalloc::tcmalloc_internal::Span*, unsigned long*) ()
#4  0x000055e43014339b in void* slow_alloc<tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, decltype(nullptr)>(tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, unsigned long, decltype(nullptr))
    ()
#5  0x000055e4301404e6 in malloc ()
#6  0x00007fc3a67b1d7e in ssl3_setup_write_buffer () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#7  0x00007fc3a67ae824 in do_ssl3_write () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#8  0x00007fc3a67ae3e1 in ssl3_write_bytes () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#9  0x00007fc3a67cc251 in ssl3_do_write () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#10 0x00007fc3a67c1a6a in state_machine () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#11 0x000055e43013d303 in mm_tls_handshake_cb (handle=<optimized out>) at ../../src/odyssey/third_party/machinarium/sources/tls.c:453
#12 0x000055e43013b9e7 in mm_epoll_step (poll=0x37beffd71a60, timeout=<optimized out>) at ../../src/odyssey/third_party/machinarium/sources/epoll.c:79
#13 0x000055e43013b386 in mm_loop_step (loop=0x37beffd72980) at ../../src/odyssey/third_party/machinarium/sources/loop.c:64
#14 machine_main (arg=0x37beffd72780) at ../../src/odyssey/third_party/machinarium/sources/machine.c:56
#15 0x00007fc3a5e89c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
#16 0x00007fc3a5f0ec40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81```
Note that, this 512 KB value is the same as the value used in tserver and master processes via the min_thread_stack_size_bytes GFlag introduced in https://phorge.dev.yugabyte.com/D38053.
Jira: DB-13388

Test Plan: Jenkins: enable connection manager, all tests

Reviewers: skumar, stiwary

Reviewed By: stiwary

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D40087
manav-yb pushed a commit that referenced this issue Nov 20, 2024
…KB in ysql connection manager

Summary:
Original commit: None / D40087
Set pthread_attr_setstacksize to 512 KB in ysql connection manager. This is to fix crashes involving Alma 9 machines.
```
#0  0x000055e430191fd7 in tcmalloc::tcmalloc_internal::PageTracker::Get(tcmalloc::tcmalloc_internal::Length) ()
#1  0x000055e430192774 in tcmalloc::tcmalloc_internal::HugePageFiller<tcmalloc::tcmalloc_internal::PageTracker>::TryGet(tcmalloc::tcmalloc_internal::Length, unsigned long)
    ()
#2  0x000055e430160b3a in tcmalloc::tcmalloc_internal::HugePageAwareAllocator::New(tcmalloc::tcmalloc_internal::Length, unsigned long) ()
#3  0x000055e4301437a2 in void* tcmalloc::tcmalloc_internal::SampleifyAllocation<tcmalloc::tcmalloc_internal::Static, tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy> >(tcmalloc::tcmalloc_internal::Static&, tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, unsigned long, unsigned long, unsigned long, void*, tcmalloc::tcmalloc_internal::Span*, unsigned long*) ()
#4  0x000055e43014339b in void* slow_alloc<tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, decltype(nullptr)>(tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, unsigned long, decltype(nullptr))
    ()
#5  0x000055e4301404e6 in malloc ()
#6  0x00007fc3a67b1d7e in ssl3_setup_write_buffer () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#7  0x00007fc3a67ae824 in do_ssl3_write () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#8  0x00007fc3a67ae3e1 in ssl3_write_bytes () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#9  0x00007fc3a67cc251 in ssl3_do_write () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#10 0x00007fc3a67c1a6a in state_machine () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#11 0x000055e43013d303 in mm_tls_handshake_cb (handle=<optimized out>) at ../../src/odyssey/third_party/machinarium/sources/tls.c:453
#12 0x000055e43013b9e7 in mm_epoll_step (poll=0x37beffd71a60, timeout=<optimized out>) at ../../src/odyssey/third_party/machinarium/sources/epoll.c:79
#13 0x000055e43013b386 in mm_loop_step (loop=0x37beffd72980) at ../../src/odyssey/third_party/machinarium/sources/loop.c:64
#14 machine_main (arg=0x37beffd72780) at ../../src/odyssey/third_party/machinarium/sources/machine.c:56
#15 0x00007fc3a5e89c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
#16 0x00007fc3a5f0ec40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81```
Note that, this 512 KB value is the same as the value used in tserver and master processes via the min_thread_stack_size_bytes GFlag introduced in https://phorge.dev.yugabyte.com/D38053.
Jira: DB-13388

Test Plan: Jenkins: enable connection manager, all tests

Reviewers: skumar, stiwary

Reviewed By: stiwary

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D40088
manav-yb pushed a commit that referenced this issue Nov 20, 2024
…2 KB in ysql connection manager

Summary:
Original commit: None / D40088
Set pthread_attr_setstacksize to 512 KB in ysql connection manager. This is to fix crashes involving Alma 9 machines.
```
#0  0x000055e430191fd7 in tcmalloc::tcmalloc_internal::PageTracker::Get(tcmalloc::tcmalloc_internal::Length) ()
#1  0x000055e430192774 in tcmalloc::tcmalloc_internal::HugePageFiller<tcmalloc::tcmalloc_internal::PageTracker>::TryGet(tcmalloc::tcmalloc_internal::Length, unsigned long)
    ()
#2  0x000055e430160b3a in tcmalloc::tcmalloc_internal::HugePageAwareAllocator::New(tcmalloc::tcmalloc_internal::Length, unsigned long) ()
#3  0x000055e4301437a2 in void* tcmalloc::tcmalloc_internal::SampleifyAllocation<tcmalloc::tcmalloc_internal::Static, tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy> >(tcmalloc::tcmalloc_internal::Static&, tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, unsigned long, unsigned long, unsigned long, void*, tcmalloc::tcmalloc_internal::Span*, unsigned long*) ()
#4  0x000055e43014339b in void* slow_alloc<tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, decltype(nullptr)>(tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::MallocAlignPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcmalloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>, unsigned long, decltype(nullptr))
    ()
#5  0x000055e4301404e6 in malloc ()
#6  0x00007fc3a67b1d7e in ssl3_setup_write_buffer () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#7  0x00007fc3a67ae824 in do_ssl3_write () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#8  0x00007fc3a67ae3e1 in ssl3_write_bytes () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#9  0x00007fc3a67cc251 in ssl3_do_write () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#10 0x00007fc3a67c1a6a in state_machine () from /home/centos/code/local_testing/conn_manager_ssl_auth/yugabyte-b124/bin/../lib/yb-thirdparty/libssl.so.3
#11 0x000055e43013d303 in mm_tls_handshake_cb (handle=<optimized out>) at ../../src/odyssey/third_party/machinarium/sources/tls.c:453
#12 0x000055e43013b9e7 in mm_epoll_step (poll=0x37beffd71a60, timeout=<optimized out>) at ../../src/odyssey/third_party/machinarium/sources/epoll.c:79
#13 0x000055e43013b386 in mm_loop_step (loop=0x37beffd72980) at ../../src/odyssey/third_party/machinarium/sources/loop.c:64
#14 machine_main (arg=0x37beffd72780) at ../../src/odyssey/third_party/machinarium/sources/machine.c:56
#15 0x00007fc3a5e89c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
#16 0x00007fc3a5f0ec40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81```
Note that, this 512 KB value is the same as the value used in tserver and master processes via the min_thread_stack_size_bytes GFlag introduced in https://phorge.dev.yugabyte.com/D38053.
Jira: DB-13388

Test Plan: Jenkins: enable connection manager, all tests

Reviewers: skumar, stiwary

Reviewed By: stiwary

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D40089
Huqicheng added a commit that referenced this issue Dec 12, 2024
…on::GetDocPaths

Summary:
```
../../src/yb/docdb/conflict_resolution.cc:865:5: runtime error: load of value 4458368, which is not a valid value for type 'IsolationLevel'
[m-1] W1212 08:27:12.423846 99051 master_heartbeat_service.cc:426] Could not get YSQL db catalog versions for heartbeat response:
[m-1] W1212 08:27:12.425787 99052 master_heartbeat_service.cc:426] Could not get YSQL db catalog versions for heartbeat response:
    #0 0x7fd9ae920c0e in yb::docdb::(anonymous namespace)::GetWriteRequestIntents(std::vector<std::unique_ptr<yb::docdb::DocOperation, std::default_delete<yb::docdb::DocOperation>>, std::allocator<std::unique_ptr<yb::docdb::DocOperation, std::default_delete<yb::docdb::DocOperation>>>> const&, yb::dockv::KeyBytes*, yb::StronglyTypedBool<yb::dockv::PartialRangeKeyIntents_Tag>, yb::IsolationLevel) ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:865:5
    #1 0x7fd9ae9190ce in yb::docdb::(anonymous namespace)::TransactionConflictResolverContext::GetRequestedIntents(yb::docdb::(anonymous namespace)::ConflictResolver*, yb::dockv::KeyBytes*) ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:1130:22
    #2 0x7fd9ae90fcee in yb::docdb::(anonymous namespace)::TransactionConflictResolverContext::ReadConflicts(yb::docdb::(anonymous namespace)::ConflictResolver*) ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:1164:22
    #3 0x7fd9ae8fd333 in yb::docdb::(anonymous namespace)::ConflictResolver::Resolve() ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:198:26
    #4 0x7fd9ae8fc3b2 in yb::docdb::(anonymous namespace)::WaitOnConflictResolver::TryPreWait() ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:697:25
    #5 0x7fd9ae8fc3b2 in yb::docdb::(anonymous namespace)::WaitOnConflictResolver::Run() ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:670:7
    #6 0x7fd9ae8fa47d in yb::docdb::ResolveTransactionConflicts(std::vector<std::unique_ptr<yb::docdb::DocOperation, std::default_delete<yb::docdb::DocOperation>>, std::allocator<std::unique_ptr<yb::docdb::DocOperation, std::default_delete<yb::docdb::DocOperation>>>> const&, yb::docdb::ConflictManagementPolicy, yb::docdb::LWKeyValueWriteBatchPB const&, yb::HybridTime, yb::HybridTime, long, unsigned long, long, yb::docdb::DocDB const&, yb::StronglyTypedBool<yb::dockv::PartialRangeKeyIntents_Tag>, yb::TransactionStatusManager*, yb::tablet::TabletMetrics*, yb::docdb::LockBatch*, yb::docdb::WaitQueue*, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, boost::function<void (yb::Result<yb::HybridTime> const&)>) ${YB_SRC_ROOT}/src/yb/docdb/conflict_resolution.cc:1401:15
    #7 0x7fd9aff22ca9 in yb::tablet::WriteQuery::DoExecute() ${YB_SRC_ROOT}/src/yb/tablet/write_query.cc:801:10
    #8 0x7fd9aff1fda3 in yb::tablet::WriteQuery::Execute(std::unique_ptr<yb::tablet::WriteQuery, std::default_delete<yb::tablet::WriteQuery>>) ${YB_SRC_ROOT}/src/yb/tablet/write_query.cc:618:28
    #9 0x7fd9afb9b708 in yb::tablet::Tablet::AcquireLocksAndPerformDocOperations(std::unique_ptr<yb::tablet::WriteQuery, std::default_delete<yb::tablet::WriteQuery>>) ${YB_SRC_ROOT}/src/yb/tablet/tablet.cc:2147:3
    #10 0x7fd9afd166d7 in yb::tablet::TabletPeer::WriteAsync(std::unique_ptr<yb::tablet::WriteQuery, std::default_delete<yb::tablet::WriteQuery>>) ${YB_SRC_ROOT}/src/yb/tablet/tablet_peer.cc:704:12
    #11 0x7fd9b0fd4d57 in yb::tserver::TabletServiceImpl::PerformWrite(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext*) ${YB_SRC_ROOT}/src/yb/tserver/tablet_service.cc:2325:16
    #12 0x7fd9b0fd711a in yb::tserver::TabletServiceImpl::Write(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext) ${YB_SRC_ROOT}/src/yb/tserver/tablet_service.cc:2345:17
    #13 0x7fd9a91098ec in yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0::operator()(std::shared_ptr<yb::rpc::InboundCall>) const::'lambda'(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext)::operator()(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext) const ${BUILD_ROOT}/src/yb/tserver/tserver_service.service.cc:848:9
    #14 0x7fd9a91098ec in auto yb::rpc::HandleCall<yb::rpc::RpcCallPBParamsImpl<yb::tserver::WriteRequestPB, yb::tserver::WriteResponsePB>, yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0::operator()(std::shared_ptr<yb::rpc::InboundCall>) const::'lambda'(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext)>(std::shared_ptr<yb::rpc::InboundCall>, yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0::operator()(std::shared_ptr<yb::rpc::InboundCall>) const::'lambda'(yb::tserver::WriteRequestPB const*, yb::tserver::WriteResponsePB*, yb::rpc::RpcContext)) ${YB_SRC_ROOT}/src/yb/rpc/local_call.h:126:7
    #15 0x7fd9a91098ec in yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0::operator()(std::shared_ptr<yb::rpc::InboundCall>) const ${BUILD_ROOT}/src/yb/tserver/tserver_service.service.cc:846:7
    #16 0x7fd9a91098ec in decltype(std::declval<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&>()(std::declval<std::shared_ptr<yb::rpc::InboundCall>>())) std::__invoke[abi:ue170006]<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::shared_ptr<yb::rpc::InboundCall>>(yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::shared_ptr<yb::rpc::InboundCall>&&) ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__type_traits/invoke.h:340:25
    #17 0x7fd9a91098ec in void std::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::shared_ptr<yb::rpc::InboundCall>>(yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::shared_ptr<yb::rpc::InboundCall>&&) ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__type_traits/invoke.h:415:5
    #18 0x7fd9a91098ec in std::__function::__alloc_func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::shared_ptr<yb::rpc::InboundCall>)>::operator()[abi:ue170006](std::shared_ptr<yb::rpc::InboundCall>&&) ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:192:16
    #19 0x7fd9a91098ec in std::__function::__func<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::allocator<yb::tserver::TabletServerServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::shared_ptr<yb::rpc::InboundCall>&&) ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:363:12
    #20 0x7fd9a9108b34 in std::__function::__value_func<void (std::shared_ptr<yb::rpc::InboundCall>)>::operator()[abi:ue170006](std::shared_ptr<yb::rpc::InboundCall>&&) const ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:517:16
    #21 0x7fd9a9108b34 in std::function<void (std::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::shared_ptr<yb::rpc::InboundCall>) const ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:1168:12
    #22 0x7fd9a9108b34 in yb::tserver::TabletServerServiceIf::Handle(std::shared_ptr<yb::rpc::InboundCall>) ${BUILD_ROOT}/src/yb/tserver/tserver_service.service.cc:831:3
    #23 0x7fd9a5b7a892 in yb::rpc::ServicePoolImpl::Handle(std::shared_ptr<yb::rpc::InboundCall>) ${YB_SRC_ROOT}/src/yb/rpc/service_pool.cc:269:19
    #24 0x7fd9a59fcea5 in yb::rpc::InboundCall::InboundCallTask::Run() ${YB_SRC_ROOT}/src/yb/rpc/inbound_call.cc:317:13
    #25 0x7fd9a5bad15d in yb::rpc::(anonymous namespace)::Worker::Execute() ${YB_SRC_ROOT}/src/yb/rpc/thread_pool.cc:115:15
    #26 0x7fd9a42ad037 in std::__function::__value_func<void ()>::operator()[abi:ue170006]() const ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:517:16
    #27 0x7fd9a42ad037 in std::function<void ()>::operator()() const ${YB_THIRDPARTY_DIR}/installed/asan/libcxx/include/c++/v1/__functional/function.h:1168:12
    #28 0x7fd9a42ad037 in yb::Thread::SuperviseThread(void*) ${YB_SRC_ROOT}/src/yb/util/thread.cc:895:3
    #29 0x56176114abea in asan_thread_start(void*) ${YB_LLVM_TOOLCHAIN_DIR}/src/llvm-project/compiler-rt/lib/asan/asan_interceptors.cpp:225:31
    #30 0x7fd99f1071c9 in start_thread (/lib64/libpthread.so.0+0x81c9) (BuildId: 1962602ac5dc3011b6d697b38b05ddc244197114)
    #31 0x7fd99eb488d2 in clone (/lib64/libc.so.6+0x398d2) (BuildId: 37e4ac6a7fb96950b0e6bf72d73d94f3296c77eb)

UndefinedBehaviorSanitizer: undefined-behavior ../../src/yb/docdb/conflict_resolution.cc:865:5 in
```

IsolationLevel is left uninitialized at `PgsqlLockOperation::GetDocPaths` so the ASAN builds can get the `not a valid value` faliure.
Jira: DB-14458

Test Plan: advisory_lock-test

Reviewers: bkolagani

Reviewed By: bkolagani

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D40638
pkj415 added a commit that referenced this issue Jan 10, 2025
Summary:
Fix asan failure for tests running the isolation regress
framework: org.yb.pgsql.TestPgRegressWaitQueues,
org.yb.pgsql.TestPgRegressIsolationWithoutWaitQueues and
org.yb.pgsql.TestPgRegressIsolation.

The failure is:
```
+=================================================================
+==31476==ERROR: LeakSanitizer: detected memory leaks
+
+Direct leak of 864 byte(s) in 4 object(s) allocated from:
+    #0 0x55fc1116466e in malloc /opt/yb-build/llvm/yb-llvm-v17.0.6-yb-1-1720414757-9b881774-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:69:3
+    #1 0x7f94aa200490 in PQmakeEmptyPGresult /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:164:24
+    #2 0x7f94aa21df9d in getRowDescriptions /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-protocol3.c
+    #3 0x7f94aa21c0bc in pqParseInput3 /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-protocol3.c:324:11
+    #4 0x7f94aa207028 in parseInput /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:2014:2
+    #5 0x7f94aa207028 in PQgetResult /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:2100:3
+    #6 0x7f94aa208437 in PQexecFinish /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:2417:19
+    #7 0x7f94aa208437 in PQexecParams /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-exec.c:2279:9
+    #8 0x55fc111a2881 in main /share/jenkins/workspace/github-yugabyte-db-alma8-master-clang17-asan/yugabyte-db/src/postgres/src/test/isolation/../../../../../../src/postgres/src/test/isolation/isolationtester.c:201:9
+    #9 0x7f94a8caa7e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 37e4ac6a7fb96950b0e6bf72d73d94f3296c77eb)
+
+Objects leaked above:
+0x511000004640 (216 bytes)
+0x511000008ec0 (216 bytes)
+0x5110000133c0 (216 bytes)
+0x5110000179c0 (216 bytes)
```
Jira: DB-13172

Test Plan: Jenkins: test regex: .*RegressIsolation.*|.*RegressWaitQueues.*

Reviewers: patnaik.balivada

Reviewed By: patnaik.balivada

Subscribers: jason, yql

Differential Revision: https://phorge.dev.yugabyte.com/D40987
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants