-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: add telemetry for node liveness #72989
kv: add telemetry for node liveness #72989
Conversation
ca0ea73
to
a86b686
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @lidorcarmel, and @rytaft)
pkg/server/telemetry/features.go, line 132 at r1 (raw file):
// CounterValue returns the telemetry value. Note that this value can be // different from MetricValue because telemetry may reset to zero occasionally.
Could you explain more about what leads to these occasional resets? Does it even make sense to export this method if it can't be reliably used without caveats?
a86b686
to
969901e
Compare
Exporting the existing metrics 'HeartbeatFailures' and 'EpochIncrements' as telementry counters. These telemetry values can be seen in cockroach demo by decommissioning a node and then querying crdb_internal.feature_usage, and also in the 'Diagnostics Reporting Data' page (/_status/diagnostics/local -> "featureUsage"). Issue cockroachdb#71662 Release note: None
969901e
to
0b7ba56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @rytaft)
pkg/server/telemetry/features.go, line 132 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you explain more about what leads to these occasional resets? Does it even make sense to export this method if it can't be reliably used without caveats?
Good point. We should not expose it, I wanted it for the test only but there is another way to get the telemetry - the same way we get those values to generate a report.
Dropped CounterValue and renamed MetricValue - "Count" fits better with what we have for metrics.
Also added a note in the comment that the "telemetry value may reset to zero when, for example, GetFeatureCounts() is called with ResetCounts to generate a report".
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @rytaft)
bors r+ |
Build failed (retrying...): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this PR close #71662? If so, typically you would want to write "Fixes #71662" or "Closes #71662" in the PR message so the issue will be automatically closed. (no worries at this point, but just FYI for the future)
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know, thanks. I thought I'd miss something like that..
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
Build succeeded: |
73149: release-21.2: kv: add telemetry for node liveness r=lidorcarmel a=lidorcarmel Backport 1/1 commits from #72989. /cc @cockroachdb/release --- Exporting the existing metrics 'HeartbeatFailures' and 'EpochIncrements' as telementry counters. These telemetry values can be seen in cockroach demo by decommissioning a node and then querying crdb_internal.feature_usage, and also in the 'Diagnostics Reporting Data' page (/_status/diagnostics/local -> "featureUsage"). Fixes #71662 Release note: None Release justification: needed to measure the effectiveness of admission control in 21.2. Co-authored-by: Lidor Carmel <[email protected]>
Exporting the existing metrics 'HeartbeatFailures' and 'EpochIncrements'
as telementry counters.
These telemetry values can be seen in cockroach demo by decommissioning
a node and then querying crdb_internal.feature_usage, and also in the
'Diagnostics Reporting Data' page (/_status/diagnostics/local ->
"featureUsage").
Issue #71662
Release note: None