Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Expose master's raft heartbeat delay for each follower as a metric #21178

Closed
1 task done
iSignal opened this issue Feb 23, 2024 · 1 comment
Closed
1 task done
Assignees
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@iSignal
Copy link
Contributor

iSignal commented Feb 23, 2024

Jira Link: DB-10113

Description

#18788 added a new RPC to expose the RAFT heartbeat delay for masters. Could we also expose this as a metric so that it can be alerted on?

@lingamsandeep @PrarabdhGarg

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@iSignal iSignal added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Feb 23, 2024
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Feb 23, 2024
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Mar 5, 2024
@yugabyte-ci yugabyte-ci assigned druzac and unassigned lingamsandeep Apr 3, 2024
@lingamsandeep
Copy link
Contributor

@druzac - assigned this to you since you added the API. Ask is metric for the same issue.

druzac added a commit that referenced this issue Sep 25, 2024
Summary:
Adds a metric for the maximum master follower heartbeat delay. Ideally this metric would report the raw follower delays directly and clients could calculate the maximum themselves. However there is no metric entity we could use to export a metric from the leader master with the right fields. Adding a new metric entity for just this metric seems too involved given the current use-case (alert if the follower delay is too large). Another alternative would be adding a server entity metric to each master, and have followers report their own delay. But we are interested in the perspective from the master leader here and that approach could introduce surprising behaviour.
Jira: DB-10113

Test Plan:
```
./yb_build.sh --with-tests --cxx-test-filter-re tablet_health_manager-itest --cxx-test tablet_health_manager-itest --gtest_filter 'AreNodesSafeToTakeDownItest.GetFollowerUpdate*'
```

Reviewers: asrivastava, sanketh

Reviewed By: asrivastava

Subscribers: ybase, slingam

Differential Revision: https://phorge.dev.yugabyte.com/D38319
timothy-e pushed a commit that referenced this issue Sep 26, 2024
Summary:
 35b12d2 [PLAT-15404] Average YSQL operations latency alert is using incorrect units (ms vs microsecs)
 Excluded: 008f885 [#23788] YSQL, QueryDiagnostics: Fixing issues in pg_stat_statements when no query executed
 6ca8cc4 [#23810] yugabyted-ui: UI is displaying incorrect disk size when multiple data directories
 dca5923 [PLAT-15034][K8s] Add changes to apply master_join_existing_cluster gflag
 fa9b370 [docs] Update content for getting started page for CDC logical replication (#23916)
 8db0ffb [PLAT-15380] clock drift alert did not reference nodes
 44ae377 [PLAT-15349] Mark universe update as success after update lb config
 Excluded: 9f90819 [#24121] xCluster: Fix xcluster_outbound_replication_group-itest TestGetStreamByTableId
 250a4d5 [#24026] docdb: Fix SIGSEGV from MaxPersistentOpId after flush
 0d1046a [DEVOPS-3238] Move macOS build to macos13 (Ventura)
 87cffc6 [#24137] DocDB: Add gflag_allowlist to yb_release_manifest
 678d277 [#21178] docdb: Add metric for the max master follower heartbeat delay.
 ff97f51 [doc][ybm] Certificate links (#24139)
 Excluded: d26b62d [#21733] YSQL: ParallelAppend and pg_hint_plan
 3ffe5a7 [PLAT-10519]Lack of Client-Side Inactivity Timeout - Part 1
 254e164 [PLAT-15432] remove status,sizeInBytes from manifest.json file

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: tfoucher, fizaa, telgersma

Differential Revision: https://phorge.dev.yugabyte.com/D38454
druzac added a commit that referenced this issue Sep 30, 2024
…er heartbeat delay.

Summary:
Adds a metric for the maximum master follower heartbeat delay. Ideally this metric would report the raw follower delays directly and clients could calculate the maximum themselves. However there is no metric entity we could use to export a metric from the leader master with the right fields. Adding a new metric entity for just this metric seems too involved given the current use-case (alert if the follower delay is too large). Another alternative would be adding a server entity metric to each master, and have followers report their own delay. But we are interested in the perspective from the master leader here and that approach could introduce surprising behaviour.
Jira: DB-10113

Original commit: 678d277 / D38319

Test Plan:
```
./yb_build.sh --with-tests --cxx-test-filter-re tablet_health_manager-itest --cxx-test tablet_health_manager-itest --gtest_filter 'AreNodesSafeToTakeDownItest.GetFollowerUpdate*'
```

Reviewers: asrivastava, sanketh

Reviewed By: asrivastava

Subscribers: slingam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38441
@druzac druzac closed this as completed Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

4 participants