Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Improved metrics #8645

Closed
rjnn opened this issue Aug 18, 2016 · 9 comments
Closed

storage: Improved metrics #8645

rjnn opened this issue Aug 18, 2016 · 9 comments
Assignees
Milestone

Comments

@rjnn
Copy link
Contributor

rjnn commented Aug 18, 2016

The storage layer (specifically, each individual store) suffers from a lack of metrics. To kick this off, #8257 introduced metrics for tracking time spent by each store in processRaft. We also decided to add in metrics on heartbeats sent and received by each store ahead of addressing #6107. This is an issue to collect ideas and suggestions about what other metrics folks might find helpful in debugging (cc @cockroachdb/stability, @cuongdo, @mberhault ). Here are some initial metrics that I am planning to add in advance of #6107:

  • Number of heartbeat messages sent by all replicas per second.
  • Number of heartbeat messages received by all replicas per second.
  • Number of elections participated in by all replicas per second.
  • Number of snapshots received by all replicas per second.
  • Number of pre-emptive snapshots received per second.

If there are other metrics that would be useful to you, or if there is a specific format you would prefer, let's track that in this issue.

@rjnn rjnn added the question label Aug 18, 2016
@rjnn rjnn added this to the Q3 milestone Aug 18, 2016
@rjnn rjnn self-assigned this Aug 18, 2016
@petermattis
Copy link
Collaborator

We already have metrics for normal and preemptive snapshots applied as well as a metric for the number of snapshots generated.

@bdarnell
Copy link
Contributor

Peter and I were just talking about recording the number of ticks per second (should be a constant, but will tell us when we're skipping ticks because processRaft is getting blocked for longer than the tick interval)

@rjnn
Copy link
Contributor Author

rjnn commented Aug 18, 2016

We do have metrics that are being tracked for normal and pre-emptive snapshots, but they currently aren't graphed or logged (as far as I can tell). I think they should be exposed in the "advanced internals" section of the node graphs. Ticks is a good idea as well.

@mberhault
Copy link
Contributor

FYI: all metrics should be in raw count of things, not rates.
eg: "total hearbeats" vs "heartbeats per second"

pre-emptive snapshots are on grafana as of earlier this morning: http://monitoring.gce.cockroachdb.com:3000/dashboard/db/cockroachinternals

@cuongdo
Copy link
Contributor

cuongdo commented Aug 18, 2016

number of ticks per second or the inverse, seconds per tick, would be my vote for top priority metric

It's good to have snapshot metrics in the admin UI for transient test clusters.

number of elections (or term changes) is a good one to track. It's a partial proxy for other issues, such as dropped Raft messages and lease holder instability.

@spencerkimball
Copy link
Member

  • Ticks
  • Raft messages sent
  • Raft messages received
  • Raft message drop count
  • Raft transport queue full count
  • Nanoseconds spent in processRaft loop
  • Excess nanoseconds spent in processRaft loop (excess is defined as amount more than tick interval)

@rjnn
Copy link
Contributor Author

rjnn commented Aug 18, 2016

Raft messages sent

All messages?

Nanoseconds spent in processRaft loop

#8257 already introduced this. Is there something missing that you would like added?

pre-emptive snapshots are on grafana as of earlier this morning:

Is there some underlying ideology behind why different things are graphed in different places? Some things are in the cockroach admin UI, some things are in grafana (which is currently not publicly available), and some things are in both. My working belief is that all these metrics should be graphs in the admin UI, since grafana needs to be independently deployed, and these metrics are most useful for debugging changes locally. Admin UI is the lowest overhead way to achieve that. In either case, anyone who wants to read the raw metrics log can also do so from _status/vars (for example, from the gamma cluster)

@bdarnell
Copy link
Contributor

We've been focusing on grafana recently because it's quicker to iterate on, can show more history, and works even when cockroachdb is having problems. Once we've settled on the metrics that are useful to graph we can sync those back to the admin UI.

@spencerkimball
Copy link
Member

@arjunravinarayan yes I'd like to see the total volume of traffic which raft is processing. Byte counts too unless those are available somewhere else.

rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 24, 2016
Add additional metrics for Raft messages to help in debugging: total
messages sent and received, transport queue length, dropped messages,
and ticks. Expose these metrics in the Admin UI, under the "Advanced
Internals" section. Closes cockroachdb#8645.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 25, 2016
Add additional metrics for Raft messages to help in debugging: total
messages sent and received, transport queue length, dropped messages,
and ticks. Expose these metrics in the Admin UI, under the "Advanced
Internals" section. Closes cockroachdb#8645.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 25, 2016
Add additional metrics for Raft messages to help in debugging: total
messages sent and received, transport queue length, dropped messages,
and ticks. Expose these metrics in the Admin UI, under the "Advanced
Internals" section. Closes cockroachdb#8645.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 26, 2016
Add additional metrics for Raft messages to help in debugging: total
messages sent and received, transport queue length, dropped messages,
and ticks. Expose these metrics in the Admin UI, under the "Advanced
Internals" section. Closes cockroachdb#8645.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 26, 2016
Add additional metrics for Raft messages to help in debugging: total
messages sent and received, transport queue length, dropped messages,
and ticks. Expose these metrics in the Admin UI, under the "Advanced
Internals" section. Closes cockroachdb#8645.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 26, 2016
Add additional metrics for Raft messages to help in debugging: total
messages sent and received, transport queue length, dropped messages,
and ticks. Expose these metrics in the Admin UI, under the "Advanced
Internals" section. Closes cockroachdb#8645.
rjnn pushed a commit to rjnn/cockroach that referenced this issue Aug 26, 2016
Add additional metrics for Raft messages to help in debugging: total
messages sent and received, transport queue length, dropped messages,
and ticks. Expose these metrics in the Admin UI, under the "Advanced
Internals" section. Closes cockroachdb#8645.
@tbg tbg closed this as completed in 396c733 Aug 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants