Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] New DB API to query master leader for the last RAFT heartbeat time of the followers #18788

Closed
1 task done
iSignal opened this issue Aug 21, 2023 · 2 comments
Closed
1 task done
Assignees
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@iSignal
Copy link
Contributor

iSignal commented Aug 21, 2023

Jira Link: DB-7670

Description

This is needed as part of the following flow related to master replacement

  1. Platform queries a new DB API to query master leader for the last RAFT heartbeat time of the followers.
  2. A lagging master that is unable to catch up via WAL should not be considered to be heartbeating succesfully.
  3. The API returns an error if the leader lease is not held so that no action is taken if a master leader does not exist.
  4. Platform will identify 1 master with max delay such that delay > configurable delta (30 mins).

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@iSignal iSignal added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Aug 21, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Aug 21, 2023
@mohatagarvit
Copy link

mohatagarvit commented Oct 31, 2023

Hi,

Hope you are doing well!

I am Garvit Mohata, a MS CS student at UT Austin. My teammate (Kulin Shah) and I are taking the graduate Distributed Systems course and as a part of the course project, we are exploring the opportunity of contributing to the yugabyte open source repository.

This issue seems closely related to the content of the course so we were wondering if we can contribute to solve this issue.

Can you provide a bit more specific expectations and details to solve this issue and pointers if possible?
Thank you!

Regards,
Garvit

@SrivastavaAnubhav
Copy link
Contributor

Hi @mohatagarvit , I'm excited to hear you're exploring contributing to Yugabyte. I think this and the other issue you commented on (#16954) are a bit hard to pick up for new contributors since they require a lot of system context. I recommend browsing issues tagged with the "Good first issue" tag. Those are ones we explicitly think are self-contained. If you have any more questions, feel free to DM me on the community slack channel.

druzac added a commit that referenced this issue Dec 4, 2023
…owers

Summary:
This diff adds a new RPC to the `MasterAdmin` to get the number of milliseconds since the master leader has successfully processed a consensus update from each of the master followers. The implementation just plumbs through the `last_successful_communication_time` field of the consensus queue up. This is the same field used by leaders to decide the health of a peer and whether to evict a peer (although masters do not evict peer masters).

I intend to do a little more cleanup work on the unit tests, but I wanted to get out a diff for review sooner.
Jira: DB-7670

Test Plan:
```
ybd --cxx-test tablet_health_manager-itest --gtest_filter '*GetFollowerUpdateDelay*'
```

Reviewers: asrivastava, rahuldesirazu

Reviewed By: asrivastava

Subscribers: ybase, bogdan, slingam

Differential Revision: https://phorge.dev.yugabyte.com/D30479
@druzac druzac closed this as completed Dec 5, 2023
lingamsandeep added a commit that referenced this issue Feb 23, 2024
… of master followers

Summary:
Original commit: 2dfc818 / D30479
This diff adds a new RPC to the `MasterAdmin` to get the number of milliseconds since the master leader has successfully processed a consensus update from each of the master followers. The implementation just plumbs through the `last_successful_communication_time` field of the consensus queue up. This is the same field used by leaders to decide the health of a peer and whether to evict a peer (although masters do not evict peer masters).

I intend to do a little more cleanup work on the unit tests, but I wanted to get out a diff for review sooner.
Jira: DB-7670

Test Plan:
```
ybd --cxx-test tablet_health_manager-itest --gtest_filter '*GetFollowerUpdateDelay*'
```

Reviewers: asrivastava, rahuldesirazu

Reviewed By: asrivastava

Subscribers: slingam, bogdan, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D32603
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

5 participants