-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workload: add prometheus metrics #66313
Conversation
This commit adds prometheus metrics to the workload binary, exposing it on port 2112 by default (configurable by CLI flag). We also add some sample metrics to test the machinery: * all histogram metrics are automatically added * success and failure rate of tpc-c operations Release note: None
Release note: None
Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah for some reason i can't let external people contribute to my fork (it just doesn't show up as an option for me).
i like the registry cleanup!
this LGTM, i'll bors it!
bors r+
Weird! I just looked at the docs around this and it didn't look like
anything you could opt out of generically, curious what's different about
your fork or your PR workflow. I assume nothing...
|
cc @cockroachdb/test-eng for completeness. |
I am looking at #66224 (comment). Sorry for last minute; if my comments don't seem worth addressing, ignore em. To me what is idiomatic is:
If you do successes & errors, the resulting error rate query is a bit more complex to compute as you need to add successes and errors to get total. Small potatoes but I'd vote we do it one way consistently at least moving forward, and I'd propose above way is the most idiomatic. I like separate request counters + latency histograms (referencing the discussion about using the latency histogram as a counter of total requests instead of having a separate counter for that), as it is a common pattern is to not measure the latency of errors with the latency histogram but also sometimes people do measure the latency of errors. It is more explicit to rely on a request count and error count to measure error rate IMO, given that it's often unclear just looking at the metric name what choice people make re: errors & latency histograms. Also, do we want metrics that measure throughput? Well, I guess the counters do that? But often when I hear people talking about TPC-C I hear em talking in terms of higher level concepts like warehouses? |
Build succeeded: |
You're saying you would prefer a histogram that covers txns regardless of the outcome? That makes sense to me and would be my preference as well. I have to figure out if we can do that though - these histograms are basically our performance data. We run tpccbench under chaos, where errors are expected; they're currently not taken into account to compute whether a TPCC run passed. I have a feeling though that the result of these chaos runs is pretty much meaningless anyway, and that we only care about the performance data in cases where the workloads fail on errors, in which case including them in the histogram is fine (since the result won't count if it contains an error). I will check on that and then make the change you suggested. |
Might've read you wrong (probably did!) You're saying
Is that correct? re: measuring throughput, I thought it was idiomatic to have that computed from the counters (rather than trying to expose rates as a metric). TPCC defines some higher-level concepts such as (I think) "the number of newOrder transactions per minute", again that is just a throughput. There's also something derived from that but it's basically just a bit of multiplication. Something that I think we should be able to express is "what's the TpmC over the last minute" and to set up alerting on that. |
FWIW the latency histograms also embed counters so if the latency does not include errors that allows you to get count without errors.
|
Yes, this is my preference. But partly it is just a matter of agreeing on a standard way to lay out the data and I think above is good standard. Maybe we need a metrics style guide.
Yup. My argument for separate counter is that looking at the latency histogram name as an operator, it is often NOT clear if it includes errors or not. I'm +1 on NOT including errors in latency histograms. But I think a more explicit way to measure error rate (and thus less likely to lead to alerting regressions) is by diving an error counter by a total request counter, instead of depending on the histogram count, which may or may not include errors (I bet both are done in the CRDB codebase today).
Ya I think you're right. Thanks for explaining. |
Opening this here since I wasn't able to push to #66224 directly.