workload: add prometheus metrics #66313

tbg · 2021-06-10T12:44:34Z

workload/tpcc: expose tpc-c ops and duration as prometheus metrics
workload: un-singletonize metrics
histogram: adjust prometheus histogram bucket sizes

Opening this here since I wasn't able to push to #66224 directly.

This commit adds prometheus metrics to the workload binary, exposing it on port 2112 by default (configurable by CLI flag). We also add some sample metrics to test the machinery: * all histogram metrics are automatically added * success and failure rate of tpc-c operations Release note: None

Release note: None

cockroach-teamcity · 2021-06-10T12:44:44Z

This change is

otan

yeah for some reason i can't let external people contribute to my fork (it just doesn't show up as an option for me).

i like the registry cleanup!

this LGTM, i'll bors it!

bors r+

tbg · 2021-06-10T13:06:11Z

Weird! I just looked at the docs around this and it didn't look like anything you could opt out of generically, curious what's different about your fork or your PR workflow. I assume nothing...

tbg · 2021-06-10T13:06:26Z

cc @cockroachdb/test-eng for completeness.

joshimhoff · 2021-06-10T13:22:20Z

I am looking at #66224 (comment). Sorry for last minute; if my comments don't seem worth addressing, ignore em.

To me what is idiomatic is:

A request count counter.
An error count counter.
A latency histogram.

If you do successes & errors, the resulting error rate query is a bit more complex to compute as you need to add successes and errors to get total. Small potatoes but I'd vote we do it one way consistently at least moving forward, and I'd propose above way is the most idiomatic.

I like separate request counters + latency histograms (referencing the discussion about using the latency histogram as a counter of total requests instead of having a separate counter for that), as it is a common pattern is to not measure the latency of errors with the latency histogram but also sometimes people do measure the latency of errors. It is more explicit to rely on a request count and error count to measure error rate IMO, given that it's often unclear just looking at the metric name what choice people make re: errors & latency histograms.

Also, do we want metrics that measure throughput? Well, I guess the counters do that? But often when I hear people talking about TPC-C I hear em talking in terms of higher level concepts like warehouses?

craig · 2021-06-10T13:32:50Z

Build succeeded:

GitHub CI (Cockroach)

tbg · 2021-06-10T15:30:17Z

You're saying you would prefer a histogram that covers txns regardless of the outcome? That makes sense to me and would be my preference as well. I have to figure out if we can do that though - these histograms are basically our performance data. We run tpccbench under chaos, where errors are expected; they're currently not taken into account to compute whether a TPCC run passed. I have a feeling though that the result of these chaos runs is pretty much meaningless anyway, and that we only care about the performance data in cases where the workloads fail on errors, in which case including them in the histogram is fine (since the result won't count if it contains an error). I will check on that and then make the change you suggested.

tbg · 2021-06-10T15:43:41Z

Might've read you wrong (probably did!)

You're saying

a request count counter (i.e. success or error, doesn't matter)
error counter
latency histogram only for successful requests

Is that correct?

re: measuring throughput, I thought it was idiomatic to have that computed from the counters (rather than trying to expose rates as a metric). TPCC defines some higher-level concepts such as (I think) "the number of newOrder transactions per minute", again that is just a throughput. There's also something derived from that but it's basically just a bit of multiplication.
The warehouse count is a property of the TPCC run. It could be exposed as a metric (gauge) if it entered the computations anywhere (it may or may not, I would have to check).

Something that I think we should be able to express is "what's the TpmC over the last minute" and to set up alerting on that.

ajwerner · 2021-06-10T15:51:53Z

FWIW the latency histograms also embed counters so if the latency does not include errors that allows you to get count without errors.

A histogram with a base metric name of exposes multiple time series during a scrape:

cumulative counters for the observation buckets, exposed as _bucket{le=""}
the total sum of all observed values, exposed as _sum
the count of events that have been observed, exposed as _count (identical to _bucket{le="+Inf"} above)

https://prometheus.io/docs/concepts/metric_types/#histogram

joshimhoff · 2021-06-10T16:20:24Z

a request count counter (i.e. success or error, doesn't matter)

error counter

latency histogram only for successful requests

Yes, this is my preference. But partly it is just a matter of agreeing on a standard way to lay out the data and I think above is good standard. Maybe we need a metrics style guide.

FWIW the latency histograms also embed counters so if the latency does not include errors that allows you to get count without errors.

Yup. My argument for separate counter is that looking at the latency histogram name as an operator, it is often NOT clear if it includes errors or not. I'm +1 on NOT including errors in latency histograms. But I think a more explicit way to measure error rate (and thus less likely to lead to alerting regressions) is by diving an error counter by a total request counter, instead of depending on the histogram count, which may or may not include errors (I bet both are done in the CRDB codebase today).

re: measuring throughput, I thought it was idiomatic to have that computed from the counters (rather than trying to expose rates as a metric).

Ya I think you're right. Thanks for explaining.

otan and others added 3 commits June 10, 2021 16:00

workload: un-singletonize metrics

284271d

Release note: None

histogram: adjust prometheus histogram bucket sizes

6be5dc6

Release note: None

tbg requested a review from a team as a code owner June 10, 2021 12:44

tbg requested review from a team June 10, 2021 12:44

tbg mentioned this pull request Jun 10, 2021

workload/tpcc: expose tpc-c ops and duration as prometheus metrics #66224

Closed

otan approved these changes Jun 10, 2021

View reviewed changes

tbg mentioned this pull request Jun 10, 2021

workload: add observability via prometheus endpoint #64068

Closed

craig bot merged commit 9eb1b96 into cockroachdb:master Jun 10, 2021

tbg deleted the workload_expose branch June 10, 2021 15:27

erikgrinaker mentioned this pull request Jun 10, 2021

roachtest: output and investigate intent stats after TPCC tests #65193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload: add prometheus metrics #66313

workload: add prometheus metrics #66313

tbg commented Jun 10, 2021

cockroach-teamcity commented Jun 10, 2021

otan left a comment

tbg commented Jun 10, 2021 via email

tbg commented Jun 10, 2021

joshimhoff commented Jun 10, 2021 •

edited

Loading

craig bot commented Jun 10, 2021

tbg commented Jun 10, 2021

tbg commented Jun 10, 2021

ajwerner commented Jun 10, 2021

joshimhoff commented Jun 10, 2021 •

edited

Loading

workload: add prometheus metrics #66313

workload: add prometheus metrics #66313

Conversation

tbg commented Jun 10, 2021

cockroach-teamcity commented Jun 10, 2021

otan left a comment

Choose a reason for hiding this comment

tbg commented Jun 10, 2021 via email

tbg commented Jun 10, 2021

joshimhoff commented Jun 10, 2021 • edited Loading

craig bot commented Jun 10, 2021

tbg commented Jun 10, 2021

tbg commented Jun 10, 2021

ajwerner commented Jun 10, 2021

joshimhoff commented Jun 10, 2021 • edited Loading

joshimhoff commented Jun 10, 2021 •

edited

Loading

joshimhoff commented Jun 10, 2021 •

edited

Loading