Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: grafana API for annotations #119821

Closed
srosenberg opened this issue Mar 1, 2024 · 4 comments · Fixed by #120019
Closed

roachtest: grafana API for annotations #119821

srosenberg opened this issue Mar 1, 2024 · 4 comments · Fixed by #120019
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team

Comments

@srosenberg
Copy link
Member

srosenberg commented Mar 1, 2024

Grafana annotations [1] are typically used to superimpose contextual information onto existing dashboard(s). E.g., a roachtest executing some workload, say TPCC, could create one annotation to denote the range of time during which data is loaded, and another annotation to denote the benchmarking phase. This new contextual information facilitates presentation of the (visual) data; specifically, it delineates unique events which may or may not be correlated. Thus, it makes it easier to find event-local patterns.

There are many other use-cases for annotations. However, our intended primary use-case is to facilitate "chaos" events during failure injection. Note that the DRT team is already using a makeshift script to create such annotations [2].

The centralized grafana instance is behind GCE's identity-aware proxy (IDP). Thus, the internal API will need to generate a fresh identity (oauth2) token; project-wide default limits the expiration time for these tokens to 1 hour. Also note that the internal grafana instance (i.e., one that's optionally provisioned for the duration of a roachtest) doesn't require authentication. Thus, the roachtest grafana API for annotations should support both methods of authentication.

[1] https://grafana.com/docs/grafana/latest/developers/http_api/annotations/#annotations-api
[2] https://gist.github.com/srosenberg/8e2758d3a9f071b72cd55116c2128a9e

Jira issue: CRDB-36333

@srosenberg srosenberg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure T-testeng TestEng Team labels Mar 1, 2024
Copy link

blathers-crl bot commented Mar 1, 2024

cc @cockroachdb/test-eng

@DarrylWong
Copy link
Contributor

DarrylWong commented Mar 1, 2024

One detail that makes things a bit trickier is that grafana annotations can only be applied over the entire org, dashboard, or panel. However, we set up our metrics such that a single roachtest will be split up throughout different dashboards based on the type of metric (i.e SQL/Queues/Storage/).

Adding the annotation over the entire org would be easiest, but we have non roachtest dashboards like the aforementioned DRT cluster that we wouldn't want to add annotations to.

Actually I had a misunderstanding about how adding to the entire org worked. I thought that it would add it to every dashboard but instead it adds it to no dashboard and you have to query them in. Which is the same amount of work as adding to a dashboard but it doesn't live on a dashboard, which seems better.

I think the best way to do it is to add annotations to the overview dashboard, and then for the other relevant dashboards, we can add an annotation query that shows annotations by filtering for tags with the current test_run_id and test_name.

The main drawback to this is that we now need to keep track of test_run_id in the roachtest so we can create the annotation with those tags in the first place. This is currently a test_runner detail only, and we'd need to pass this somewhere.

test.Test makes the most sense to me, since we already need to know test.Name(). The unfortunate part about this is that it then ties creating grafana annotations with knowing the test spec, which doesn't seem right to me since they don't really seem related. Maybe not the biggest deal in the world though 🤷 Open to other ideas though.

@srosenberg
Copy link
Member Author

Adding the annotation over the entire org would be easiest, but we have non roachtest dashboards like the aforementioned DRT cluster that we wouldn't want to add annotations to.

What if we just rely on tags to filter out irrelevant org-wide annotations? E.g., for roachtest specific annotations, we could filter on roachtest, for DRT annotations, using drt tag, etc.

The main drawback to this is that we now need to keep track of test_run_id in the roachtest

What if we use cluster (name) in conjunction with test_name instead of exposing test_run_id? (Recall that no two tests can time-share the same cluster.) It seems that the tags can be filtered by using template variables, and most of our dashboards already use the cluster variable.

[1] https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/annotate-visualizations/#filter-queries-by-tag

@DarrylWong
Copy link
Contributor

What if we just rely on tags to filter out irrelevant org-wide annotations?

Ahh, thinking about it more, I don't think this is really an issue. We're gonna need filters anyway so we don't see the annotations from every tests at once.

The way annotation filtering works in grafana seems a bit clunky - you can't really filter out annotations from what I can tell, only query them in. As in you can't add a query to filter out all annotations that don't match a tag, you have to query for all annotations that do match it. Which is why the approach I mentioned of adding to just one dashboard wouldn't be any extra work than adding org wide now that I think about it.

What if we use cluster (name) in conjunction with test_name instead of exposing test_run_id?

It looks like we have two sets of dashboards? One that's filtered by cluster and one that's filtered by test run and name? I'm not sure which one people use more but I'd guess the latter since that's what the roachtest output spits out:

https://github.com/cockroachdb/cockroach/blob/b4b889347fb12215bb09307fab89296916c85c1f/pkg/cmd/roachtest/test_runner.go#L982-L983

I'd need to test it, but I don't think those dashboards have the cluster varriable:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants