Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workload: add prometheus metrics #66313

Merged
merged 3 commits into from
Jun 10, 2021
Merged

Conversation

tbg
Copy link
Member

@tbg tbg commented Jun 10, 2021

  • workload/tpcc: expose tpc-c ops and duration as prometheus metrics
  • workload: un-singletonize metrics
  • histogram: adjust prometheus histogram bucket sizes

Opening this here since I wasn't able to push to #66224 directly.

otan and others added 3 commits June 10, 2021 16:00
This commit adds prometheus metrics to the workload binary, exposing it
on port 2112 by default (configurable by CLI flag).

We also add some sample metrics to test the machinery:
* all histogram metrics are automatically added
* success and failure rate of tpc-c operations

Release note: None
@tbg tbg requested a review from a team as a code owner June 10, 2021 12:44
@tbg tbg requested review from a team June 10, 2021 12:44
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah for some reason i can't let external people contribute to my fork (it just doesn't show up as an option for me).

i like the registry cleanup!

this LGTM, i'll bors it!

bors r+

@tbg
Copy link
Member Author

tbg commented Jun 10, 2021 via email

@tbg
Copy link
Member Author

tbg commented Jun 10, 2021

cc @cockroachdb/test-eng for completeness.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Jun 10, 2021

I am looking at #66224 (comment). Sorry for last minute; if my comments don't seem worth addressing, ignore em.

To me what is idiomatic is:

  1. A request count counter.
  2. An error count counter.
  3. A latency histogram.

If you do successes & errors, the resulting error rate query is a bit more complex to compute as you need to add successes and errors to get total. Small potatoes but I'd vote we do it one way consistently at least moving forward, and I'd propose above way is the most idiomatic.

I like separate request counters + latency histograms (referencing the discussion about using the latency histogram as a counter of total requests instead of having a separate counter for that), as it is a common pattern is to not measure the latency of errors with the latency histogram but also sometimes people do measure the latency of errors. It is more explicit to rely on a request count and error count to measure error rate IMO, given that it's often unclear just looking at the metric name what choice people make re: errors & latency histograms.

Also, do we want metrics that measure throughput? Well, I guess the counters do that? But often when I hear people talking about TPC-C I hear em talking in terms of higher level concepts like warehouses?

@craig
Copy link
Contributor

craig bot commented Jun 10, 2021

Build succeeded:

@craig craig bot merged commit 9eb1b96 into cockroachdb:master Jun 10, 2021
@tbg tbg deleted the workload_expose branch June 10, 2021 15:27
@tbg
Copy link
Member Author

tbg commented Jun 10, 2021

You're saying you would prefer a histogram that covers txns regardless of the outcome? That makes sense to me and would be my preference as well. I have to figure out if we can do that though - these histograms are basically our performance data. We run tpccbench under chaos, where errors are expected; they're currently not taken into account to compute whether a TPCC run passed. I have a feeling though that the result of these chaos runs is pretty much meaningless anyway, and that we only care about the performance data in cases where the workloads fail on errors, in which case including them in the histogram is fine (since the result won't count if it contains an error). I will check on that and then make the change you suggested.

@tbg
Copy link
Member Author

tbg commented Jun 10, 2021

Might've read you wrong (probably did!)

You're saying

  • a request count counter (i.e. success or error, doesn't matter)
  • error counter
  • latency histogram only for successful requests

Is that correct?

re: measuring throughput, I thought it was idiomatic to have that computed from the counters (rather than trying to expose rates as a metric). TPCC defines some higher-level concepts such as (I think) "the number of newOrder transactions per minute", again that is just a throughput. There's also something derived from that but it's basically just a bit of multiplication.
The warehouse count is a property of the TPCC run. It could be exposed as a metric (gauge) if it entered the computations anywhere (it may or may not, I would have to check).

Something that I think we should be able to express is "what's the TpmC over the last minute" and to set up alerting on that.

@ajwerner
Copy link
Contributor

FWIW the latency histograms also embed counters so if the latency does not include errors that allows you to get count without errors.

A histogram with a base metric name of exposes multiple time series during a scrape:

cumulative counters for the observation buckets, exposed as _bucket{le=""}
the total sum of all observed values, exposed as _sum
the count of events that have been observed, exposed as _count (identical to _bucket{le="+Inf"} above)

https://prometheus.io/docs/concepts/metric_types/#histogram

@joshimhoff
Copy link
Collaborator

joshimhoff commented Jun 10, 2021

  1. a request count counter (i.e. success or error, doesn't matter)
  2. error counter
  3. latency histogram only for successful requests

Yes, this is my preference. But partly it is just a matter of agreeing on a standard way to lay out the data and I think above is good standard. Maybe we need a metrics style guide.

FWIW the latency histograms also embed counters so if the latency does not include errors that allows you to get count without errors.

Yup. My argument for separate counter is that looking at the latency histogram name as an operator, it is often NOT clear if it includes errors or not. I'm +1 on NOT including errors in latency histograms. But I think a more explicit way to measure error rate (and thus less likely to lead to alerting regressions) is by diving an error counter by a total request counter, instead of depending on the histogram count, which may or may not include errors (I bet both are done in the CRDB codebase today).

re: measuring throughput, I thought it was idiomatic to have that computed from the counters (rather than trying to expose rates as a metric).

Ya I think you're right. Thanks for explaining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants