-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workload/tpcc: expose tpc-c ops and duration as prometheus metrics #66224
Conversation
14fd8aa
to
b653666
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for getting this started! I have a few suggestions that I think we should get to. Broadly this is avoiding singletons, and integrating the histograms with the existing histograms. By the end of this no particular workload should need to define prometheus histograms explicitly, they should exist as a byproduct of the calls to hist.Get()
that are already present in the workloads (but workloads can go wild with additional pure prometheus metrics where desired).
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @otan)
pkg/workload/tpcc/worker.go, line 52 at r1 (raw file):
// successCounter measures the number of successes for the given tx. successCounter prometheus.Counter
I don't think you need that, since you have
pkg/workload/tpcc/worker.go, line 94 at r1 (raw file):
func init() { for i, tx := range allTxs { allTxs[i].successCounter = prometheusMetrics.NewCounter(
This is too much singleton - allTxs
is shared because it's all stateless, but now we're essentially adding state to it. We shouldn't do that, it is perfectly reasonable to want to invoke multiple copies of a workload in a single test and get dedicated stats for each. Maybe this doesn't work now for other reasons (we have a unit test for this workload so probably it does or at least almost should) but we should move things in the right direction. I think this would naturally fall out of my other comment below about unifying both versions of histograms we have here now.
It's also worth thinking about how we'll distinguish metrics from different workloads all scraped by the same prometheus instance. kv
and bank
might both end up with a metric named op_success
; how do they get ambiguated? Perhaps by convention the metric for workload x
should be named x_...
; I don't think we need to quite bake that into the framework (would lead to lots of annoying wrapping) but we should start it off that way.
pkg/workload/tpcc/worker.go, line 96 at r1 (raw file):
allTxs[i].successCounter = prometheusMetrics.NewCounter( prometheus.CounterOpts{ Name: fmt.Sprintf("op_%s_success", tx.name),
I don't think you need a success counter, since you already have a success durations histogram. Histograms contain a count.
pkg/workload/tpcc/worker.go, line 107 at r1 (raw file):
) allTxs[i].durationHistogram = prometheusMetrics.NewHistogram( prometheus.HistogramOpts{
I think you want to populate HistogramOpts.Bucket
with some invocation of ExponentialBuckets
or even more hand-crafted buckets. As is this is a very low-res histogram and as such not very useful.
I think we also want built-in parity between any of the existing histograms, i.e. when you define one of the ones we have now (workload/histogram
) you also get a prometheus one (& have to pass a suitable registry in). We can't "just" rip out the existing histograms because they're windowed (i.e. they forget their history, so that we can produce printouts with "recent latency distributions"); they're serving different purposes. But we should unify them. I'm less bullish on making the prometheus histogram buckets match exactly the bucket boundaries of the hdrhistogram. I originally chose those and I don't think there's any reason to assume that I did a particularly good job. Maybe we can lean on @joshimhoff and @nvanbenschoten (with their SRE and Perf, respectively) hats on to tell us what good bucket boundaries are. We don't have many histograms here so we can afford a lot of buckets, but there's also no particular reason to go overboard. We can start with https://play.golang.org/p/TEUs4Vw7Z5e (start=0.5ms, factor=1.5, count=25) but it seems a little low-res in the hundreds-of-ms range; it does feel like the hdrhistogram approach is better since it scales more naturally but either way that bikeshed can also wait until we've landed the first cut.
I think my impl approach would be to make make histogram.NewRegistry
also take a prometheus.Registry
and to populate it when initializing the instance of the workload (not in the init
function). As an exception, we'll create histograms as a byproduct of hist.Get("newOrder")
and hist.Get("foo").RecordValuewill tee to both histograms. We'll expose the
prometheus.Registryfrom the
hist` handle so we can add any additional metrics we care about (like the error count you have here, that's a good one).
pkg/workload/tpcc/worker.go, line 108 at r1 (raw file):
allTxs[i].durationHistogram = prometheusMetrics.NewHistogram( prometheus.HistogramOpts{ Name: fmt.Sprintf("op_%s_durationMS", tx.name),
I read through https://prometheus.io/docs/practices/naming/ and https://prometheus.io/docs/instrumenting/writing_exporters/#target-labels-not-static-scraped-labels and I think this should be called workload_tpcc_%s_success_seconds_bucket
with the newOrder
etc camel case lowercased or snake-cased. This will also, in the common case, avoid a need to disambiguate metrics from multiple workloads when collected on a single prometheus instance (except when you're running multiple instances of TPCC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @otan)
pkg/workload/tpcc/worker.go, line 52 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I don't think you need that, since you have
... a success histogram already. See other comment below.
pkg/workload/tpcc/worker.go, line 107 at r1 (raw file):
I think my impl approach would be to make make histogram.NewRegistry also take
I meant that histogram.NewRegistry
would also make a prometheus one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need singletons as we can only have one unique name for a metric per prometheus registry.
I've added the histogram to be default, and remove the hooks to make workload contain the prometheus registry. PTAL.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/workload/tpcc/worker.go, line 94 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is too much singleton -
allTxs
is shared because it's all stateless, but now we're essentially adding state to it. We shouldn't do that, it is perfectly reasonable to want to invoke multiple copies of a workload in a single test and get dedicated stats for each. Maybe this doesn't work now for other reasons (we have a unit test for this workload so probably it does or at least almost should) but we should move things in the right direction. I think this would naturally fall out of my other comment below about unifying both versions of histograms we have here now.It's also worth thinking about how we'll distinguish metrics from different workloads all scraped by the same prometheus instance.
kv
andbank
might both end up with a metric namedop_success
; how do they get ambiguated? Perhaps by convention the metric for workloadx
should be namedx_...
; I don't think we need to quite bake that into the framework (would lead to lots of annoying wrapping) but we should start it off that way.
you can only initialize one counter with that name, which is why i've done it this way.
i'll apply the naming changes.
pkg/workload/tpcc/worker.go, line 96 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I don't think you need a success counter, since you already have a success durations histogram. Histograms contain a count.
i like having it explicit (more obvious to search + use). the histogram byproduct is nice, but it's nice having a clear "success" and "error" message.
in other words, how strongly do you feel about this?
pkg/workload/tpcc/worker.go, line 108 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I read through https://prometheus.io/docs/practices/naming/ and https://prometheus.io/docs/instrumenting/writing_exporters/#target-labels-not-static-scraped-labels and I think this should be called
workload_tpcc_%s_success_seconds_bucket
with thenewOrder
etc camel case lowercased or snake-cased. This will also, in the common case, avoid a need to disambiguate metrics from multiple workloads when collected on a single prometheus instance (except when you're running multiple instances of TPCC).
the _bucket
is already added automatically when you run it, but renamed to _duration_milliseconds
:
i prefer ms
as the unit of measurement, as most operations are in the scale.
659f9c9
to
8a4a350
Compare
This commit adds prometheus metrics to the workload binary, exposing it on port 2112 by default (configurable by CLI flag). We also add some sample metrics to test the machinery: * all histogram metrics are automatically added * success and failure rate of tpc-c operations Release note: None
pkg/workload/tpcc/worker.go, line 108 at r1 (raw file): Previously, otan (Oliver Tan) wrote…
hmm, nvm, i'll do seconds |
In case it's helpful, here's how the metrics currently look:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That looks a lot better. I added a commit that does a bit more work to streamline things, let me know if I've accidentally regressed on anything. The singleton is gone and the way TPCC adds extra prometheus metrics is, for my taste, now idiomatic.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @otan)
pkg/workload/tpcc/worker.go, line 94 at r1 (raw file):
you can only initialize one counter with that name, which is why i've done it this way.
Sharing the registry across workloads using a singleton is part of the problem. I added a commit that un-singletonizes everything. WDYT?
pkg/workload/tpcc/worker.go, line 96 at r1 (raw file):
Previously, otan (Oliver Tan) wrote…
i like having it explicit (more obvious to search + use). the histogram byproduct is nice, but it's nice having a clear "success" and "error" message.
in other words, how strongly do you feel about this?
I'm not going to push back more since it's not consequential, but I can't really say that I see your point - look, they're even right next to each other. To me this duplication adds confusion, if anything. The only value is in the HELP text and we could give the histograms a better help text; they're "helpless" right now because we haven't gone through the trouble of improving the w.hists.Get()
thing but we could (not saying we should at this point). I guess I feel strongly but I'm willing to drop it regardless 😄
# HELP workload_tpcc_delivery_success_duration_seconds
# TYPE workload_tpcc_delivery_success_duration_seconds histogram
workload_tpcc_delivery_success_duration_seconds_bucket{le="0.005"} 0
[...]
workload_tpcc_delivery_success_duration_seconds_sum 0.059154021
workload_tpcc_delivery_success_duration_seconds_count 2
# HELP workload_tpcc_delivery_success_total The total number of successful delivery transactions.
# TYPE workload_tpcc_delivery_success_total counter
workload_tpcc_delivery_success_total 2
Hrm, I don't seem to be able to push to your branch (did you perhaps disable that? I generally find it useful). Here are my commits: |
66313: workload: add prometheus metrics r=otan a=tbg - workload/tpcc: expose tpc-c ops and duration as prometheus metrics - workload: un-singletonize metrics - histogram: adjust prometheus histogram bucket sizes Opening this here since I wasn't able to push to #66224 directly. Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
This commit adds prometheus metrics to the workload binary, exposing it
on port 2112 by default (configurable by CLI flag).
We also add some sample metrics to test the machinery:
Release note: None