Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: output and investigate intent stats after TPCC tests #65193

Closed
erikgrinaker opened this issue May 14, 2021 · 6 comments
Closed

roachtest: output and investigate intent stats after TPCC tests #65193

erikgrinaker opened this issue May 14, 2021 · 6 comments
Assignees
Labels
A-testing Testing tools and infrastructure C-investigation Further steps needed to qualify. C-label will change.

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 14, 2021

As part of the intent buildup investigation in #60585 we should output the intent counts as reported by MVCCStats after TPCC workloads complete, and make sure they are generally 0 in all expected cases (e.g. barring node restarts).

Epic: CRDB-2554

Jira issue: CRDB-7476

@erikgrinaker erikgrinaker added C-investigation Further steps needed to qualify. C-label will change. A-testing Testing tools and infrastructure labels May 14, 2021
@aliher1911
Copy link
Contributor

Current state of things:
We don't know the state of intents after our TPCC or other roachtest runs.
We could try to get some signals out of this metric if we start monitoring it.
Getting intent counts is trivial and could be done in handful of lines of code inside a roachtest. For headroom test this number seem to be consistently zero and consistently non zero for tpcc benchmarks which rely on system restart after loading initial data.

What to do with this number?
We can actually fail a test if the number is non 0 as we don't expect tests to produce any abandoned intents when nodes don't suffer failures. This approach sounds a bit harsh as this is not strictly a failure and we have mechanisms in the system to cleanup those intents. We also don't know how often that does happen and if it would start breaking builds.
The better approach would be to monitor those numbers over some time to see how tests behave. But for that we don't have any facility to collect it. Existing perf histograms only capture timing metrics and not final values and it would also be abuse of mechanism.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented May 31, 2021

If we've had a look at some TPCC tests and found intent counts to be 0 in all expected cases then I don't think there's much more we need to do here. Most of the intent leaks we've found have been along sad paths (typically client disconnects and errors), and I think it's more fruitful to actively provoke intent leaks with targeted integration tests such as TestReliableIntentCleanup -- these will also give a much better repro signal.

@aliher1911
Copy link
Contributor

aliher1911 commented Jun 10, 2021

I actually tried to see what we can get out of it and what I propose is we could just collect "health" metrics that we want to observe and push them together with stats.json that we push for performance histograms. We can add that to any tests and that would provide us historical perspective or maybe early signs of failures if we have those non-functional metrics.
We can easily make a json with the values and push it to artifacts on the workload node.
The we can tell TC to push it to the same location as perf metrics.
Something like: #66315

@erikgrinaker
Copy link
Contributor Author

Right. We're actually considering what to do about roachperf and roachtest data in the test-eng team these days, since the current system isn't really covering our needs.

Not sure where we'll end up, but we're moving in the direction of exposing Prometheus metrics about workloads and tests, which we can pull into some data platform once we figure out what system we want to use for processing and visualizing it (see e.g. #66313 which added this to workload just today).

For these intent counts, I think it'd make sense to simply use the Prometheus intent metrics we already expose since that's the direction we're heading in, rather than adding another data channel for this. But would be curious to get @cockroachdb/test-eng's thoughts.

@aliher1911
Copy link
Contributor

If we going to make something dedicated for that it makes no sense to add this. Printing that to logs is pretty much useless. Should we park this then until there's clarity with metrics?

@erikgrinaker
Copy link
Contributor Author

Yeah, let's wait and see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

3 participants