roachtest: output and investigate intent stats after TPCC tests #65193

erikgrinaker · 2021-05-14T14:07:41Z

As part of the intent buildup investigation in #60585 we should output the intent counts as reported by MVCCStats after TPCC workloads complete, and make sure they are generally 0 in all expected cases (e.g. barring node restarts).

Epic: CRDB-2554

Jira issue: CRDB-7476

The text was updated successfully, but these errors were encountered:

aliher1911 · 2021-05-28T13:29:16Z

Current state of things:
We don't know the state of intents after our TPCC or other roachtest runs.
We could try to get some signals out of this metric if we start monitoring it.
Getting intent counts is trivial and could be done in handful of lines of code inside a roachtest. For headroom test this number seem to be consistently zero and consistently non zero for tpcc benchmarks which rely on system restart after loading initial data.

What to do with this number?
We can actually fail a test if the number is non 0 as we don't expect tests to produce any abandoned intents when nodes don't suffer failures. This approach sounds a bit harsh as this is not strictly a failure and we have mechanisms in the system to cleanup those intents. We also don't know how often that does happen and if it would start breaking builds.
The better approach would be to monitor those numbers over some time to see how tests behave. But for that we don't have any facility to collect it. Existing perf histograms only capture timing metrics and not final values and it would also be abuse of mechanism.

erikgrinaker · 2021-05-31T08:34:30Z

If we've had a look at some TPCC tests and found intent counts to be 0 in all expected cases then I don't think there's much more we need to do here. Most of the intent leaks we've found have been along sad paths (typically client disconnects and errors), and I think it's more fruitful to actively provoke intent leaks with targeted integration tests such as TestReliableIntentCleanup -- these will also give a much better repro signal.

aliher1911 · 2021-06-10T16:08:09Z

I actually tried to see what we can get out of it and what I propose is we could just collect "health" metrics that we want to observe and push them together with stats.json that we push for performance histograms. We can add that to any tests and that would provide us historical perspective or maybe early signs of failures if we have those non-functional metrics.
We can easily make a json with the values and push it to artifacts on the workload node.
The we can tell TC to push it to the same location as perf metrics.
Something like: #66315

erikgrinaker · 2021-06-10T20:17:07Z

Right. We're actually considering what to do about roachperf and roachtest data in the test-eng team these days, since the current system isn't really covering our needs.

Not sure where we'll end up, but we're moving in the direction of exposing Prometheus metrics about workloads and tests, which we can pull into some data platform once we figure out what system we want to use for processing and visualizing it (see e.g. #66313 which added this to workload just today).

For these intent counts, I think it'd make sense to simply use the Prometheus intent metrics we already expose since that's the direction we're heading in, rather than adding another data channel for this. But would be curious to get @cockroachdb/test-eng's thoughts.

aliher1911 · 2021-06-11T11:10:35Z

If we going to make something dedicated for that it makes no sense to add this. Printing that to logs is pretty much useless. Should we park this then until there's clarity with metrics?

erikgrinaker · 2021-06-11T11:13:22Z

Yeah, let's wait and see.

erikgrinaker added C-investigation Further steps needed to qualify. C-label will change. A-testing Testing tools and infrastructure labels May 14, 2021

erikgrinaker assigned aliher1911 May 14, 2021

erikgrinaker mentioned this issue Jun 15, 2021

roachtest: store TPC-C prometheus metrics as snapshot in artifacts #66352

Closed

jlinder added the T-kv KV Team label Jun 16, 2021

exalate-issue-sync bot added T-kv-replication and removed T-kv KV Team labels May 25, 2022

erikgrinaker closed this as completed May 25, 2022

erikgrinaker closed this as not planned Won't fix, can't repro, duplicate, stale May 26, 2022

exalate-issue-sync bot closed this as completed Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: output and investigate intent stats after TPCC tests #65193

roachtest: output and investigate intent stats after TPCC tests #65193

erikgrinaker commented May 14, 2021 •

edited by cockroach-jira-scripts

Loading

aliher1911 commented May 28, 2021

erikgrinaker commented May 31, 2021 •

edited

Loading

aliher1911 commented Jun 10, 2021 •

edited

Loading

erikgrinaker commented Jun 10, 2021

aliher1911 commented Jun 11, 2021

erikgrinaker commented Jun 11, 2021

roachtest: output and investigate intent stats after TPCC tests #65193

roachtest: output and investigate intent stats after TPCC tests #65193

Comments

erikgrinaker commented May 14, 2021 • edited by cockroach-jira-scripts Loading

aliher1911 commented May 28, 2021

erikgrinaker commented May 31, 2021 • edited Loading

aliher1911 commented Jun 10, 2021 • edited Loading

erikgrinaker commented Jun 10, 2021

aliher1911 commented Jun 11, 2021

erikgrinaker commented Jun 11, 2021

erikgrinaker commented May 14, 2021 •

edited by cockroach-jira-scripts

Loading

erikgrinaker commented May 31, 2021 •

edited

Loading

aliher1911 commented Jun 10, 2021 •

edited

Loading