Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add metric and log when raft.Storage returns an error #113245

Merged
merged 1 commit into from
Nov 6, 2023

Conversation

sumeerbhola
Copy link
Collaborator

The raft.storage.error metric is incremented on an error, and the error is logged every 30s (across all replicas).

This was motivated by a test cluster that slowed to a crawl because of deliberate data loss, but was hard to diagnose. The metric could be used for alerting, since we don't expect to see transient errors.

Informs #113053

Epic: none

Release note: None

@sumeerbhola sumeerbhola requested review from erikgrinaker and a team October 27, 2023 20:18
@sumeerbhola sumeerbhola requested a review from a team as a code owner October 27, 2023 20:18
@blathers-crl
Copy link

blathers-crl bot commented Oct 27, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this!

@@ -1418,6 +1418,12 @@ cache will already have moved on to newer entries.
Measurement: "Bytes",
Unit: metric.Unit_BYTES,
}
metaRaftStorageError = metric.Metadata{
Name: "raft.storage.error",
Help: "Number of calls to the raft.Storage API that returned an error",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the "raft.Storage API" part seems a bit too low-level for users, consider perhaps "Number of Raft storage errors".

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/metrics.go line 1423 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: the "raft.Storage API" part seems a bit too low-level for users, consider perhaps "Number of Raft storage errors".

Done

@sumeerbhola
Copy link
Collaborator Author

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Oct 31, 2023

Build failed:

The raft.storage.error metric is incremented on an error, and the error
is logged every 30s (across all replicas).

This was motivated by a test cluster that slowed to a crawl because of
deliberate data loss, but was hard to diagnose. The metric could be used
for alerting, since we don't expect to see transient errors.

Informs cockroachdb#113053

Epic: none

Release note: None
@sumeerbhola
Copy link
Collaborator Author

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Nov 6, 2023

Build succeeded:

@craig craig bot merged commit 4a53d0b into cockroachdb:master Nov 6, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants