Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: store TPC-C prometheus metrics as snapshot in artifacts #66352

Closed
wants to merge 1 commit into from

Conversation

otan
Copy link
Contributor

@otan otan commented Jun 11, 2021

This commit adds a prometheus scraper onto the machines that run TPC-C
workloads and scrapes their metrics.

At the end of the run, we take a snapshot of the metrics of the workload
and store it as an artifact, which can be used later.

Release note: None

@otan otan requested review from tbg and a team June 11, 2021 08:33
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@otan
Copy link
Contributor Author

otan commented Jun 11, 2021

this is kind of draft-y (some rough edges!), but ready for a look (advice on making this clean be good, but feel free to add your own commits as well)

@tbg
Copy link
Member

tbg commented Jun 11, 2021

Nice! I think this needs a little bit of discussion about goals though. The current state of affairs is that we have basically decided that workloads that run as part of CockroachCloud (on release quali clusters, etc) should be monitored via prometheus (because that's how CC monitors things). We also feel that we should lean more on this standard architecture in roachtest, but we haven't made a plan for what this means. I think there are a few things we need to figure out short-term:

  1. Are we primarily looking at adding a prometheus deployment to roachtest as an optional thing you can turn on during debugging? Or are we actually trying to have each roachtest run also run prometheus?
  2. if the latter, we need to figure out a few things:
    • can we get away with downloading prometheus every time? I'd say no - we need to bake it into the images, with dev-inf's help (tests downloading random stuff from the internet via apt-get and the likes is a problem we have today already, so likely time to double down on it).
    • how does this interact with the local stressing of roachtests? I think pooly, so we need to be able to turn it off.
    • do we run one prometheus instance per roachtest invocation (i.e. shared across all tests) or one per test? The latter is pretty high overhead, but makes it easier to disambiguate the artifacts. With the former we need to update the scraping configs regularly to add the current test name to each scrape target.
  3. if (still) the latter, what are we actually trying to do with prometheus and the data? We could
    • replace stats.json and drive our benchmark pipeline (roachperf) from prometheus data instead. But then we need a plan for doing that and in particular a prometheus data directory isn't what we want. We want a proper export of the data and tooling around it.
    • try to interact with prometheus from the tests directly - this has dependencies on some of the bullets above (as now tests would get a hard dependency on prometheus) and there is the question of what this would look like.

It will likely take us a little bit to figure this out. I would definitely want to merge (a polished version of this) as experimental. It'll be really cool to be able to run a test and look at the workload metrics via prometheus in an opt-in fashion, and in particular we can prototype the aggregations that the release engineering team can then deploy on CockroachCloud release qualification clusters when they get around to deploying workloads there.

@otan otan marked this pull request as draft June 11, 2021 12:16
This commit adds a prometheus scraper onto the machines that run TPC-C
workloads and scrapes their metrics.

At the end of the run, we take a snapshot of the metrics of the workload
and store it as an artifact, which can be used later.

Release note: None
@erikgrinaker
Copy link
Contributor

replace stats.json and drive our benchmark pipeline (roachperf) from prometheus data instead. But then we need a plan for doing that and in particular a prometheus data directory isn't what we want. We want a proper export of the data and tooling around it.

This is related to #65193, where we'd like to inspect the final intent counts after the tests complete (but not necessarily fail the tests over this).

It seems to me like it'd be more useful to just archive the nodes' Prometheus timeseries for all runs in a queryable form. Firstly, this would allow us to use a bunch of standard tooling to interact with it while piggybacking on existing observability efforts, and I think more interestingly, this would give us tons of historical data that we could go back and look at e.g. if we're investigating a bug or performance regression. For example, it would have been super-useful if we had the intent data for the last year and could see when the intent leaks started, since we'd collected the data before we even knew we had a problem.

@otan
Copy link
Contributor Author

otan commented Jun 21, 2021

As discussed in the meeting, making this optional -- using #66657 (forgot I put this one up)

@otan otan closed this Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants