Skip to content

Commit

Permalink
[#23645] docdb: Reorder heartbeat handling logic to fix regression.
Browse files Browse the repository at this point in the history
Summary:
A prior commit (15786f3) unintentionally re-ordered some logic in the heartbeat handling path. Before this commit, there was a sequence:

1. Potentially register heartbeating TServer.
2. Fill response with various metadata for piggy-backing features.
3. Lookup heartbeating Tserver.

The commit merged the logic of registering and looking up a tserver into:
1. Potentially register heartbeating TServer and look it up.
2. Fill response with various metadata for piggy-backing features.

If a TServer needed to register say after a master leader failover, the old code would stop handling the heartbeat request at the lookup stage, after filling in the heartbeat for piggy-backing features. However the new code after commit 15786f3 would stop handling the heartbeat before populating the response. This broke a few system tests.

This diff simply reorders registering / looking up the TServer with populating the response. This is not exactly the same as the original semantics - if registering a TServer yields an error the old code wouldn't populate the response but the new code would - but these are edge cases. It's not entirely clear the old code ever returned an error on the registration path, or that this condition is ever triggered in practice. At any rate preserving these semantics is not worth the complexity.
Jira: DB-12556

Test Plan:
```
./yb_build.sh --cxx-test master_heartbeat-itest --gtest_filter MasterHeartbeatITest.PopulateHeartbeatResponseWhenRegistrationRequired
```

Reviewers: jhe, hsunder

Reviewed By: jhe, hsunder

Subscribers: hsunder, slingam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37629
  • Loading branch information
druzac committed Aug 29, 2024
1 parent b14851d commit 8a0d6ff
Show file tree
Hide file tree
Showing 2 changed files with 48 additions and 5 deletions.
45 changes: 44 additions & 1 deletion src/yb/integration-tests/master_heartbeat-itest.cc
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,15 @@
#include "yb/client/table_handle.h"
#include "yb/client/yb_table_name.h"

#include "yb/common/common_types.pb.h"

#include "yb/integration-tests/cluster_itest_util.h"
#include "yb/integration-tests/external_mini_cluster.h"
#include "yb/integration-tests/yb_table_test_base.h"

#include "yb/master/catalog_manager_if.h"
#include "yb/master/catalog_entity_info.h"
#include "yb/master/catalog_manager_if.h"
#include "yb/master/master_backup.proxy.h"
#include "yb/master/master_cluster.proxy.h"
#include "yb/master/master_fwd.h"
#include "yb/master/master_heartbeat.proxy.h"
Expand Down Expand Up @@ -326,6 +329,46 @@ TEST_F(MasterHeartbeatITest, ProcessHeartbeatAfterTSRestart) {
ASSERT_EQ(ts->latest_report_seqno(), 0);
}

TEST_F(MasterHeartbeatITest, PopulateHeartbeatResponseWhenRegistrationRequired) {
master::MasterBackupProxy backup_proxy(
proxy_cache_.get(), mini_cluster_->mini_master()->bound_rpc_addr());
ASSERT_OK(client_->CreateNamespaceIfNotExists("yugabyte", YQL_DATABASE_CQL));

// Create a snapshot schedule. Heartbeat responses should always include information on
// snapshot schedules so long as the call is successful and the response's error object is not
// set.
master::CreateSnapshotScheduleRequestPB req;
master::CreateSnapshotScheduleResponsePB resp;
rpc::RpcController rpc;
auto* namespace_filter =
req.mutable_options()->mutable_filter()->mutable_tables()->add_tables()->mutable_namespace_();
*namespace_filter->mutable_name() = "yugabyte";
namespace_filter->set_database_type(YQL_DATABASE_CQL);
req.mutable_options()->set_interval_sec(60);
req.mutable_options()->set_retention_duration_sec(5 * 60);
ASSERT_OK(backup_proxy.CreateSnapshotSchedule(req, &resp, &rpc));
ASSERT_FALSE(resp.has_error()) << resp.DebugString();

// Fabricate a dummy heartbeat request from a new tserver. The master leader should ask us to
// register.
master::TSHeartbeatRequestPB hb_req;
hb_req.mutable_common()->mutable_ts_instance()->set_permanent_uuid("fake-uuid");
hb_req.mutable_common()->mutable_ts_instance()->set_instance_seqno(0);
auto& catalog_mgr = ASSERT_RESULT(mini_cluster_->GetLeaderMiniMaster())->catalog_manager();
hb_req.set_universe_uuid(ASSERT_RESULT(catalog_mgr.GetClusterConfig()).universe_uuid());
master::TSHeartbeatResponsePB hb_resp;
rpc.Reset();
master::MasterHeartbeatProxy heartbeat_proxy(
proxy_cache_.get(), mini_cluster_->mini_master()->bound_rpc_addr());
// The heartbeat response should ask us to re-register but it should also include metadata that
// piggy-backs on heartbeats such as the list of snapshot schedules.
ASSERT_OK(heartbeat_proxy.TSHeartbeat(hb_req, &hb_resp, &rpc));
ASSERT_FALSE(hb_resp.has_error()) << StatusFromPB(hb_resp.error().status());
ASSERT_TRUE(hb_resp.needs_reregister());
ASSERT_GT(hb_resp.snapshots_info().schedules_size(), 0);
ASSERT_EQ(hb_resp.snapshots_info().schedules(0).id(), resp.snapshot_schedule_id());
}

class MasterHeartbeatITestWithUpgrade : public YBTableTestBase {
public:
void SetUp() override {
Expand Down
8 changes: 4 additions & 4 deletions src/yb/master/master_heartbeat_service.cc
Original file line number Diff line number Diff line change
Expand Up @@ -313,15 +313,15 @@ void MasterHeartbeatServiceImpl::TSHeartbeat(
return;
}

if (!FillHeartbeatResponseOrRespond(*req, resp, &rpc).ok()) {
return;
}
auto desc_result = GetTSDescriptorOrRespond(*req, resp, &rpc);
if (!desc_result.ok()) {
return;
}
TSDescriptorPtr ts_desc = std::move(*desc_result);

if (!FillHeartbeatResponseOrRespond(*req, resp, &rpc).ok()) {
return;
}
resp->set_tablet_report_limit(FLAGS_tablet_report_limit);

// Set the TServer metrics in TS Descriptor.
Expand Down Expand Up @@ -1465,7 +1465,7 @@ Result<TSDescriptorPtr> MasterHeartbeatServiceImpl::RegisterTServerOrRespond(
auto status = std::move(desc_result.status());
if (status.IsAlreadyPresent()) {
// AlreadyPresent indicates a host port collision. This tserver shouldn't be heartbeating
// anymore, but in any case ask it to re-register.
// anymore, but in any case ask it to re-register to preserve existing semantics.
LOG(INFO) << Format(
"Got heartbeat from unknown tablet server { $0 } as $1; Asking this server to "
"re-register. Status from ts lookup: $2",
Expand Down

0 comments on commit 8a0d6ff

Please sign in to comment.