storage: Improved metrics #8645

rjnn · 2016-08-18T15:54:58Z

The storage layer (specifically, each individual store) suffers from a lack of metrics. To kick this off, #8257 introduced metrics for tracking time spent by each store in processRaft. We also decided to add in metrics on heartbeats sent and received by each store ahead of addressing #6107. This is an issue to collect ideas and suggestions about what other metrics folks might find helpful in debugging (cc @cockroachdb/stability, @cuongdo, @mberhault ). Here are some initial metrics that I am planning to add in advance of #6107:

Number of heartbeat messages sent by all replicas per second.
Number of heartbeat messages received by all replicas per second.
Number of elections participated in by all replicas per second.
Number of snapshots received by all replicas per second.
Number of pre-emptive snapshots received per second.

If there are other metrics that would be useful to you, or if there is a specific format you would prefer, let's track that in this issue.

The text was updated successfully, but these errors were encountered:

petermattis · 2016-08-18T15:56:55Z

We already have metrics for normal and preemptive snapshots applied as well as a metric for the number of snapshots generated.

bdarnell · 2016-08-18T16:57:04Z

Peter and I were just talking about recording the number of ticks per second (should be a constant, but will tell us when we're skipping ticks because processRaft is getting blocked for longer than the tick interval)

rjnn · 2016-08-18T16:58:30Z

We do have metrics that are being tracked for normal and pre-emptive snapshots, but they currently aren't graphed or logged (as far as I can tell). I think they should be exposed in the "advanced internals" section of the node graphs. Ticks is a good idea as well.

mberhault · 2016-08-18T17:04:02Z

FYI: all metrics should be in raw count of things, not rates.
eg: "total hearbeats" vs "heartbeats per second"

pre-emptive snapshots are on grafana as of earlier this morning: http://monitoring.gce.cockroachdb.com:3000/dashboard/db/cockroachinternals

cuongdo · 2016-08-18T17:14:16Z

number of ticks per second or the inverse, seconds per tick, would be my vote for top priority metric

It's good to have snapshot metrics in the admin UI for transient test clusters.

number of elections (or term changes) is a good one to track. It's a partial proxy for other issues, such as dropped Raft messages and lease holder instability.

spencerkimball · 2016-08-18T19:45:18Z

Ticks
Raft messages sent
Raft messages received
Raft message drop count
Raft transport queue full count
Nanoseconds spent in processRaft loop
Excess nanoseconds spent in processRaft loop (excess is defined as amount more than tick interval)

rjnn · 2016-08-18T19:52:04Z

Raft messages sent

All messages?

Nanoseconds spent in processRaft loop

#8257 already introduced this. Is there something missing that you would like added?

pre-emptive snapshots are on grafana as of earlier this morning:

Is there some underlying ideology behind why different things are graphed in different places? Some things are in the cockroach admin UI, some things are in grafana (which is currently not publicly available), and some things are in both. My working belief is that all these metrics should be graphs in the admin UI, since grafana needs to be independently deployed, and these metrics are most useful for debugging changes locally. Admin UI is the lowest overhead way to achieve that. In either case, anyone who wants to read the raw metrics log can also do so from _status/vars (for example, from the gamma cluster)

bdarnell · 2016-08-18T19:56:13Z

We've been focusing on grafana recently because it's quicker to iterate on, can show more history, and works even when cockroachdb is having problems. Once we've settled on the metrics that are useful to graph we can sync those back to the admin UI.

spencerkimball · 2016-08-18T19:59:02Z

@arjunravinarayan yes I'd like to see the total volume of traffic which raft is processing. Byte counts too unless those are available somewhere else.

Add additional metrics for Raft messages to help in debugging: total messages sent and received, transport queue length, dropped messages, and ticks. Expose these metrics in the Admin UI, under the "Advanced Internals" section. Closes cockroachdb#8645.

rjnn added the question label Aug 18, 2016

rjnn added this to the Q3 milestone Aug 18, 2016

rjnn self-assigned this Aug 18, 2016

mberhault added the A-monitoring label Aug 18, 2016

rjnn mentioned this issue Aug 24, 2016

storage, ui: add metrics for raft messages #8803

Merged

tbg closed this as completed in 396c733 Aug 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Improved metrics #8645

storage: Improved metrics #8645

rjnn commented Aug 18, 2016

petermattis commented Aug 18, 2016

bdarnell commented Aug 18, 2016

rjnn commented Aug 18, 2016

mberhault commented Aug 18, 2016

cuongdo commented Aug 18, 2016 •

edited

Loading

spencerkimball commented Aug 18, 2016

rjnn commented Aug 18, 2016

bdarnell commented Aug 18, 2016

spencerkimball commented Aug 18, 2016

storage: Improved metrics #8645

storage: Improved metrics #8645

Comments

rjnn commented Aug 18, 2016

petermattis commented Aug 18, 2016

bdarnell commented Aug 18, 2016

rjnn commented Aug 18, 2016

mberhault commented Aug 18, 2016

cuongdo commented Aug 18, 2016 • edited Loading

spencerkimball commented Aug 18, 2016

rjnn commented Aug 18, 2016

bdarnell commented Aug 18, 2016

spencerkimball commented Aug 18, 2016

cuongdo commented Aug 18, 2016 •

edited

Loading