roachtest: store TPC-C prometheus metrics as snapshot in artifacts #66352

otan · 2021-06-11T08:33:49Z

This commit adds a prometheus scraper onto the machines that run TPC-C
workloads and scrapes their metrics.

At the end of the run, we take a snapshot of the metrics of the workload
and store it as an artifact, which can be used later.

Release note: None

cockroach-teamcity · 2021-06-11T08:33:56Z

This change is

otan · 2021-06-11T08:34:55Z

this is kind of draft-y (some rough edges!), but ready for a look (advice on making this clean be good, but feel free to add your own commits as well)

tbg · 2021-06-11T09:35:52Z

Nice! I think this needs a little bit of discussion about goals though. The current state of affairs is that we have basically decided that workloads that run as part of CockroachCloud (on release quali clusters, etc) should be monitored via prometheus (because that's how CC monitors things). We also feel that we should lean more on this standard architecture in roachtest, but we haven't made a plan for what this means. I think there are a few things we need to figure out short-term:

Are we primarily looking at adding a prometheus deployment to roachtest as an optional thing you can turn on during debugging? Or are we actually trying to have each roachtest run also run prometheus?
if the latter, we need to figure out a few things:
- can we get away with downloading prometheus every time? I'd say no - we need to bake it into the images, with dev-inf's help (tests downloading random stuff from the internet via apt-get and the likes is a problem we have today already, so likely time to double down on it).
- how does this interact with the local stressing of roachtests? I think pooly, so we need to be able to turn it off.
- do we run one prometheus instance per roachtest invocation (i.e. shared across all tests) or one per test? The latter is pretty high overhead, but makes it easier to disambiguate the artifacts. With the former we need to update the scraping configs regularly to add the current test name to each scrape target.
if (still) the latter, what are we actually trying to do with prometheus and the data? We could
- replace stats.json and drive our benchmark pipeline (roachperf) from prometheus data instead. But then we need a plan for doing that and in particular a prometheus data directory isn't what we want. We want a proper export of the data and tooling around it.
- try to interact with prometheus from the tests directly - this has dependencies on some of the bullets above (as now tests would get a hard dependency on prometheus) and there is the question of what this would look like.

It will likely take us a little bit to figure this out. I would definitely want to merge (a polished version of this) as experimental. It'll be really cool to be able to run a test and look at the workload metrics via prometheus in an opt-in fashion, and in particular we can prototype the aggregations that the release engineering team can then deploy on CockroachCloud release qualification clusters when they get around to deploying workloads there.

This commit adds a prometheus scraper onto the machines that run TPC-C workloads and scrapes their metrics. At the end of the run, we take a snapshot of the metrics of the workload and store it as an artifact, which can be used later. Release note: None

erikgrinaker · 2021-06-15T07:49:56Z

replace stats.json and drive our benchmark pipeline (roachperf) from prometheus data instead. But then we need a plan for doing that and in particular a prometheus data directory isn't what we want. We want a proper export of the data and tooling around it.

This is related to #65193, where we'd like to inspect the final intent counts after the tests complete (but not necessarily fail the tests over this).

It seems to me like it'd be more useful to just archive the nodes' Prometheus timeseries for all runs in a queryable form. Firstly, this would allow us to use a bunch of standard tooling to interact with it while piggybacking on existing observability efforts, and I think more interestingly, this would give us tons of historical data that we could go back and look at e.g. if we're investigating a bug or performance regression. For example, it would have been super-useful if we had the intent data for the last year and could see when the intent leaks started, since we'd collected the data before we even knew we had a problem.

otan · 2021-06-21T03:01:17Z

As discussed in the meeting, making this optional -- using #66657 (forgot I put this one up)

otan requested review from tbg and a team June 11, 2021 08:33

otan force-pushed the scraper branch from 2fd5bab to 96a646f Compare June 11, 2021 09:14

otan marked this pull request as draft June 11, 2021 12:16

otan force-pushed the scraper branch from 96a646f to 35e52a3 Compare June 15, 2021 02:53

otan closed this Jun 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: store TPC-C prometheus metrics as snapshot in artifacts #66352

roachtest: store TPC-C prometheus metrics as snapshot in artifacts #66352

otan commented Jun 11, 2021

cockroach-teamcity commented Jun 11, 2021

otan commented Jun 11, 2021 •

edited

Loading

tbg commented Jun 11, 2021

erikgrinaker commented Jun 15, 2021

otan commented Jun 21, 2021

roachtest: store TPC-C prometheus metrics as snapshot in artifacts #66352

roachtest: store TPC-C prometheus metrics as snapshot in artifacts #66352

Conversation

otan commented Jun 11, 2021

cockroach-teamcity commented Jun 11, 2021

otan commented Jun 11, 2021 • edited Loading

tbg commented Jun 11, 2021

erikgrinaker commented Jun 15, 2021

otan commented Jun 21, 2021

otan commented Jun 11, 2021 •

edited

Loading