Skip to content

Commit

Permalink
[yugabyte#18744] DocDB: Fixed issue with last_rpc_start_time_ reset w…
Browse files Browse the repository at this point in the history
…hich resulted in tsan race in connection

Summary:
This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues.

tsan race details -
- Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck.
```
[ts-2]   Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0):
[ts-2]     #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f)
[ts-2]     #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f)
[ts-2]     yugabyte#3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976)
[ts-2]     yugabyte#4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e)

[ts-2]   Previous write of size 8 at 0x7b54000c0198 by thread T25:
[ts-2]     #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a)
[ts-2]     #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a)
[ts-2]     yugabyte#3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a)
[ts-2]     yugabyte#4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64)
```

- RpcController call_ concurrent access - fix will avoid calling finished() function call.
```
[m-2] WARNING: ThreadSanitizer: data race (pid=27931)
[m-2]   Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0):
[m-2]     #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4)
[m-2]     #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4)
[m-2]     yugabyte#3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4)
[m-2]     yugabyte#4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb)

[m-2]   Previous read of size 8 at 0x7b6000040708 by thread T7:
[m-2]     #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea)
[m-2]     #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea)
[m-2]     yugabyte#3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934)
[m-2]     yugabyte#4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67)
```
Jira: DB-7637

Test Plan:
./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0
./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers
Jenkins

Reviewers: mbautin

Reviewed By: mbautin

Subscribers: mbautin, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D28161
  • Loading branch information
karan-yb committed Aug 29, 2023
1 parent de44d6f commit 0dc13e6
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/yb/consensus/consensus_peers.cc
Original file line number Diff line number Diff line change
Expand Up @@ -512,7 +512,7 @@ void Peer::ProcessResponse() {
void Peer::ProcessHeartbeatResponse(const Status& status) {
DCHECK(performing_heartbeat_mutex_.is_locked()) << "Got a heartbeat when nothing was pending.";
DCHECK(heartbeat_request_.ops().empty()) << "Got a heartbeat with a non-zero number of ops.";
last_rpc_start_time_.store(CoarseMonoClock::now(), std::memory_order_release);
last_rpc_start_time_.store(CoarseTimePoint::min(), std::memory_order_release);

auto performing_heartbeat_lock = LockPerformingHeartbeat(std::adopt_lock);
auto processing_lock = StartProcessingUnlocked();
Expand Down

0 comments on commit 0dc13e6

Please sign in to comment.