Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: add telemetry for node liveness #72989

Merged
merged 1 commit into from
Nov 24, 2021

Conversation

lidorcarmel
Copy link
Contributor

@lidorcarmel lidorcarmel commented Nov 19, 2021

Exporting the existing metrics 'HeartbeatFailures' and 'EpochIncrements'
as telementry counters.

These telemetry values can be seen in cockroach demo by decommissioning
a node and then querying crdb_internal.feature_usage, and also in the
'Diagnostics Reporting Data' page (/_status/diagnostics/local ->
"featureUsage").

Issue #71662

Release note: None

@lidorcarmel lidorcarmel requested a review from a team as a code owner November 19, 2021 19:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@lidorcarmel lidorcarmel force-pushed the lidor_telemetry_liveness branch 4 times, most recently from ca0ea73 to a86b686 Compare November 20, 2021 00:14
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @lidorcarmel, and @rytaft)


pkg/server/telemetry/features.go, line 132 at r1 (raw file):

// CounterValue returns the telemetry value. Note that this value can be
// different from MetricValue because telemetry may reset to zero occasionally.

Could you explain more about what leads to these occasional resets? Does it even make sense to export this method if it can't be reliably used without caveats?

@lidorcarmel lidorcarmel force-pushed the lidor_telemetry_liveness branch from a86b686 to 969901e Compare November 23, 2021 05:34
Exporting the existing metrics 'HeartbeatFailures' and 'EpochIncrements'
as telementry counters.

These telemetry values can be seen in cockroach demo by decommissioning
a node and then querying crdb_internal.feature_usage, and also in the
'Diagnostics Reporting Data' page (/_status/diagnostics/local ->
"featureUsage").

Issue cockroachdb#71662

Release note: None
@lidorcarmel lidorcarmel force-pushed the lidor_telemetry_liveness branch from 969901e to 0b7ba56 Compare November 23, 2021 05:39
Copy link
Contributor Author

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @rytaft)


pkg/server/telemetry/features.go, line 132 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you explain more about what leads to these occasional resets? Does it even make sense to export this method if it can't be reliably used without caveats?

Good point. We should not expose it, I wanted it for the test only but there is another way to get the telemetry - the same way we get those values to generate a report.
Dropped CounterValue and renamed MetricValue - "Count" fits better with what we have for metrics.
Also added a note in the comment that the "telemetry value may reset to zero when, for example, GetFeatureCounts() is called with ResetCounts to generate a report".
Thanks!

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @rytaft)

@lidorcarmel
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 24, 2021

Build failed (retrying...):

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this PR close #71662? If so, typically you would want to write "Fixes #71662" or "Closes #71662" in the PR message so the issue will be automatically closed. (no worries at this point, but just FYI for the future)

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Contributor Author

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, thanks. I thought I'd miss something like that..

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@craig
Copy link
Contributor

craig bot commented Nov 24, 2021

Build succeeded:

@craig craig bot merged commit 47ce2f0 into cockroachdb:master Nov 24, 2021
@lidorcarmel lidorcarmel deleted the lidor_telemetry_liveness branch November 24, 2021 23:08
craig bot pushed a commit that referenced this pull request Nov 29, 2021
73149: release-21.2: kv: add telemetry for node liveness r=lidorcarmel a=lidorcarmel

Backport 1/1 commits from #72989.

/cc @cockroachdb/release

---

Exporting the existing metrics 'HeartbeatFailures' and 'EpochIncrements'
as telementry counters.

These telemetry values can be seen in cockroach demo by decommissioning
a node and then querying crdb_internal.feature_usage, and also in the
'Diagnostics Reporting Data' page (/_status/diagnostics/local ->
"featureUsage").

Fixes #71662

Release note: None

Release justification: needed to measure the effectiveness of admission control in 21.2.

Co-authored-by: Lidor Carmel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants