-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: store TPC-C prometheus metrics as snapshot in artifacts #66352
Conversation
this is kind of draft-y (some rough edges!), but ready for a look (advice on making this clean be good, but feel free to add your own commits as well) |
Nice! I think this needs a little bit of discussion about goals though. The current state of affairs is that we have basically decided that workloads that run as part of CockroachCloud (on release quali clusters, etc) should be monitored via prometheus (because that's how CC monitors things). We also feel that we should lean more on this standard architecture in roachtest, but we haven't made a plan for what this means. I think there are a few things we need to figure out short-term:
It will likely take us a little bit to figure this out. I would definitely want to merge (a polished version of this) as experimental. It'll be really cool to be able to run a test and look at the workload metrics via prometheus in an opt-in fashion, and in particular we can prototype the aggregations that the release engineering team can then deploy on CockroachCloud release qualification clusters when they get around to deploying workloads there. |
This commit adds a prometheus scraper onto the machines that run TPC-C workloads and scrapes their metrics. At the end of the run, we take a snapshot of the metrics of the workload and store it as an artifact, which can be used later. Release note: None
This is related to #65193, where we'd like to inspect the final intent counts after the tests complete (but not necessarily fail the tests over this). It seems to me like it'd be more useful to just archive the nodes' Prometheus timeseries for all runs in a queryable form. Firstly, this would allow us to use a bunch of standard tooling to interact with it while piggybacking on existing observability efforts, and I think more interestingly, this would give us tons of historical data that we could go back and look at e.g. if we're investigating a bug or performance regression. For example, it would have been super-useful if we had the intent data for the last year and could see when the intent leaks started, since we'd collected the data before we even knew we had a problem. |
As discussed in the meeting, making this optional -- using #66657 (forgot I put this one up) |
This commit adds a prometheus scraper onto the machines that run TPC-C
workloads and scrapes their metrics.
At the end of the run, we take a snapshot of the metrics of the workload
and store it as an artifact, which can be used later.
Release note: None