[DocDB] Expose master's raft heartbeat delay for each follower as a metric #21178

iSignal · 2024-02-23T19:17:06Z

Description

#18788 added a new RPC to expose the RAFT heartbeat delay for masters. Could we also expose this as a metric so that it can be alerted on?

@lingamsandeep @PrarabdhGarg

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

lingamsandeep · 2024-04-03T17:10:59Z

@druzac - assigned this to you since you added the API. Ask is metric for the same issue.

Summary: Adds a metric for the maximum master follower heartbeat delay. Ideally this metric would report the raw follower delays directly and clients could calculate the maximum themselves. However there is no metric entity we could use to export a metric from the leader master with the right fields. Adding a new metric entity for just this metric seems too involved given the current use-case (alert if the follower delay is too large). Another alternative would be adding a server entity metric to each master, and have followers report their own delay. But we are interested in the perspective from the master leader here and that approach could introduce surprising behaviour. Jira: DB-10113 Test Plan: ``` ./yb_build.sh --with-tests --cxx-test-filter-re tablet_health_manager-itest --cxx-test tablet_health_manager-itest --gtest_filter 'AreNodesSafeToTakeDownItest.GetFollowerUpdate*' ``` Reviewers: asrivastava, sanketh Reviewed By: asrivastava Subscribers: ybase, slingam Differential Revision: https://phorge.dev.yugabyte.com/D38319

Summary: 35b12d2 [PLAT-15404] Average YSQL operations latency alert is using incorrect units (ms vs microsecs) Excluded: 008f885 [#23788] YSQL, QueryDiagnostics: Fixing issues in pg_stat_statements when no query executed 6ca8cc4 [#23810] yugabyted-ui: UI is displaying incorrect disk size when multiple data directories dca5923 [PLAT-15034][K8s] Add changes to apply master_join_existing_cluster gflag fa9b370 [docs] Update content for getting started page for CDC logical replication (#23916) 8db0ffb [PLAT-15380] clock drift alert did not reference nodes 44ae377 [PLAT-15349] Mark universe update as success after update lb config Excluded: 9f90819 [#24121] xCluster: Fix xcluster_outbound_replication_group-itest TestGetStreamByTableId 250a4d5 [#24026] docdb: Fix SIGSEGV from MaxPersistentOpId after flush 0d1046a [DEVOPS-3238] Move macOS build to macos13 (Ventura) 87cffc6 [#24137] DocDB: Add gflag_allowlist to yb_release_manifest 678d277 [#21178] docdb: Add metric for the max master follower heartbeat delay. ff97f51 [doc][ybm] Certificate links (#24139) Excluded: d26b62d [#21733] YSQL: ParallelAppend and pg_hint_plan 3ffe5a7 [PLAT-10519]Lack of Client-Side Inactivity Timeout - Part 1 254e164 [PLAT-15432] remove status,sizeInBytes from manifest.json file Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: tfoucher, fizaa, telgersma Differential Revision: https://phorge.dev.yugabyte.com/D38454

…er heartbeat delay. Summary: Adds a metric for the maximum master follower heartbeat delay. Ideally this metric would report the raw follower delays directly and clients could calculate the maximum themselves. However there is no metric entity we could use to export a metric from the leader master with the right fields. Adding a new metric entity for just this metric seems too involved given the current use-case (alert if the follower delay is too large). Another alternative would be adding a server entity metric to each master, and have followers report their own delay. But we are interested in the perspective from the master leader here and that approach could introduce surprising behaviour. Jira: DB-10113 Original commit: 678d277 / D38319 Test Plan: ``` ./yb_build.sh --with-tests --cxx-test-filter-re tablet_health_manager-itest --cxx-test tablet_health_manager-itest --gtest_filter 'AreNodesSafeToTakeDownItest.GetFollowerUpdate*' ``` Reviewers: asrivastava, sanketh Reviewed By: asrivastava Subscribers: slingam, ybase Differential Revision: https://phorge.dev.yugabyte.com/D38441

iSignal added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Feb 23, 2024

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Feb 23, 2024

iSignal assigned lingamsandeep Feb 23, 2024

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Mar 5, 2024

yugabyte-ci assigned druzac and unassigned lingamsandeep Apr 3, 2024

druzac closed this as completed Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Expose master's raft heartbeat delay for each follower as a metric #21178

[DocDB] Expose master's raft heartbeat delay for each follower as a metric #21178

iSignal commented Feb 23, 2024 •

edited by jira bot

Loading

lingamsandeep commented Apr 3, 2024

[DocDB] Expose master's raft heartbeat delay for each follower as a metric #21178

[DocDB] Expose master's raft heartbeat delay for each follower as a metric #21178

Comments

iSignal commented Feb 23, 2024 • edited by jira bot Loading

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

lingamsandeep commented Apr 3, 2024

iSignal commented Feb 23, 2024 •

edited by jira bot

Loading