Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add IOThreshold to metrics #87424

Closed
tbg opened this issue Sep 6, 2022 · 3 comments · Fixed by #87656
Closed

kvserver: add IOThreshold to metrics #87424

tbg opened this issue Sep 6, 2022 · 3 comments · Fixed by #87656
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-starter Might be suitable for a starter project for new employees or team members.

Comments

@tbg
Copy link
Member

tbg commented Sep 6, 2022

We should expose a store's IOThreshold as a store metric.

This is straightforward.

Noticed during #86775 that SRE is alerting on a naked FileCount.

Jira issue: CRDB-19342

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication labels Sep 6, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 6, 2022

cc @cockroachdb/replication

@kvoli kvoli added the E-starter Might be suitable for a starter project for new employees or team members. label Sep 6, 2022
KaiSun314 added a commit to KaiSun314/cockroach that referenced this issue Sep 8, 2022
Resolves cockroachdb#87424.

Previously, only the unnormalized values of the LSM L0 sub-level and
file counts is exposed externally, not the store's IOThreshold.

This was inadequate because it is tedious to normalize and compare the
LSM L0 sub-level and file counts (as they require dividing by different
numbers).

To address this, this patch adds a metric `admission.io.overload`
tracking the store's IOThreshold.

Release note (ops change): Added a metric `admission.io.overload` which
tracks the store's IOThreshold.
@tbg
Copy link
Member Author

tbg commented Sep 9, 2022

cc @joshimhoff want to track this as it closes since I think you'll want to add alerting on this new metric as soon as it becomes available.

@kvoli planning backport to 22.2 or not?

@kvoli
Copy link
Collaborator

kvoli commented Sep 9, 2022

I think a backport to 22.2 would be a good idea. I'll work with @KaiSun314 after it's merged to backport.

KaiSun314 added a commit to KaiSun314/cockroach that referenced this issue Sep 12, 2022
Resolves cockroachdb#87424.

Previously, only the unnormalized values of the LSM L0 sub-level and
file counts is exposed externally, not the store's IOThreshold.

This was inadequate because it is tedious to normalize and compare the
LSM L0 sub-level and file counts (as they require dividing by different
numbers).

To address this, this patch adds a metric `admission.io.overload`
tracking the store's IOThreshold.

Release note (ops change): Added a metric `admission.io.overload` which
tracks the store's IOThreshold.
@KaiSun314 KaiSun314 self-assigned this Sep 15, 2022
craig bot pushed a commit that referenced this issue Sep 19, 2022
87656: kvserver: add `admission.io.overload` metric r=kvoli a=KaiSun314

Resolves #87424.

Previously, only the unnormalized values of the LSM L0 sub-level and file counts is exposed externally, not the store's IOThreshold.

This was inadequate because it is tedious to normalize and compare the LSM L0 sub-level and file counts (as they require dividing by different numbers).

To address this, this patch adds a metric `admission.io.overload` tracking the store's IOThreshold.

Release note (ops change): Added a metric `admission.io.overload` which tracks the store's IOThreshold.

Note: I am not certain on the placement of the `admission.io.overload` metric, but it seems reasonable.
Justification:
1. Pretty much all updates to store metrics outside of ComputeMetrics in pkg/kv/kvstore/store.go seem to be increments (e.g. Inc(1) )
2. ComputeMetrics (there is a comment in the code: ComputeMetrics immediately computes the current value of store metrics which cannot be computed incrementally) updates the store metrics via updateCapacityGauges , updateReplicationGauges , updateEngineMetrics , and updateEnvStats. Looking at all these functions, updateReplicationGauges (where we are currently updating the IOOverload metric) seems to make the most sense to update the IOOverload metric
3. There seem to be two other admission metrics:
- RaftPausedFollowerCount - updated in updateReplicationGauges
- RaftPausedFollowerDroppedMsgs.Inc - updated in pkg/kv/kvserver/replica_raft.go via r.store.Metrics().RaftPausedFollowerDroppedMsgs.Inc(1) 

Co-authored-by: Kai Sun <[email protected]>
@craig craig bot closed this as completed in d5571a2 Sep 19, 2022
KaiSun314 added a commit to KaiSun314/cockroach that referenced this issue Sep 19, 2022
Resolves cockroachdb#87424.

Previously, only the unnormalized values of the LSM L0 sub-level and
file counts is exposed externally, not the store's IOThreshold.

This was inadequate because it is tedious to normalize and compare the
LSM L0 sub-level and file counts (as they require dividing by different
numbers).

To address this, this patch adds a metric `admission.io.overload`
tracking the store's IOThreshold.

Release note (ops change): Added a metric `admission.io.overload` which
tracks the store's IOThreshold.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-starter Might be suitable for a starter project for new employees or team members.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants