Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(grafana): repair node selection & metrics name #158

Closed
wants to merge 1 commit into from

Conversation

aslafy-z
Copy link
Contributor

@aslafy-z aslafy-z commented Mar 27, 2024

  • Update 'dns' & 'clusters' Grafana dashboards to fix node selection
  • Update 'pod-level' Grafana dashboard to fix metrics names, pod selection and datasource templating
  • Update datasource variable to DS_PROMETHEUS convention

Note: The pod-level grafana dashboard still have some old metrics to update.

@aslafy-z aslafy-z marked this pull request as ready for review March 27, 2024 15:58
@aslafy-z aslafy-z requested a review from a team as a code owner March 27, 2024 15:58
@vakalapa
Copy link
Contributor

@huntergregory to review.

@aslafy-z aslafy-z changed the title fix(grafana): repair node selection fix(grafana): repair node selection & metrics name Mar 27, 2024
@rbtr rbtr requested a review from huntergregory March 27, 2024 16:23
@rbtr rbtr added type/fix Fixes something area/infra Test, Release, or CI Infrastructure labels Mar 27, 2024
@rbtr rbtr added the priority/1 P1 label Mar 28, 2024
@aslafy-z aslafy-z force-pushed the patch-3 branch 3 times, most recently from 8a7d8aa to 2416abb Compare April 2, 2024 07:53
Copy link

github-actions bot commented May 3, 2024

This PR will be closed in 7 days due to inactivity.

@github-actions github-actions bot added the meta/waiting-for-author Blocked and waiting on the author label May 3, 2024
@aslafy-z
Copy link
Contributor Author

aslafy-z commented May 3, 2024

Please have a look @vakalapa @huntergregory @rbtr

@rbtr
Copy link
Collaborator

rbtr commented May 3, 2024

hey @aslafy-z, thanks for working on this fix. I see that you have the DCO "signed-off-by" on all your commits, but we also need a cryptographic sig to be able to guarantee origin. Would you update these with a signature? Here's how: https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits
I'm looking for this "Verified" tag commits in the PR once you've done that:
image

@aslafy-z
Copy link
Contributor Author

aslafy-z commented May 3, 2024

@rbtr I just rebased, squashed and signed my commits :)

Copy link
Collaborator

@rbtr rbtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just need final sign-off from @huntergregory

@github-actions github-actions bot removed the meta/waiting-for-author Blocked and waiting on the author label May 4, 2024
Copy link
Contributor

@huntergregory huntergregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aslafy-z, sorry for the delay. I missed the notifications from March 😕. Thanks for your PR and the interest in improving these dashboards.

Do you mind updating the PR description with details/examples as needed for the bugs/fixes? Also, if you have a working pod-level dashboard, could you help fix #271?

Added more details in the comment, but I don't think it would make sense to filter by node in the "Fleet View". I'm also not sure that we can/should change datasource to DS_PROMETHEUS.

},
"editorMode": "code",
"expr": "sum(rate(networkobservability_forward_count{direction=\"egress\", cluster=\"$cluster\", instance=~\"$Nodes\"}[$__rate_interval]))",
"expr": "sum(rate(networkobservability_forward_count{direction=\"egress\", cluster=\"$cluster\", instance=~\"($Nodes):[0-9]+\"}[$__rate_interval]))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the : do in this regex? Also, could you help me understand scenarios where the node selection is broken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The $Nodes variable has one or multiple node ips formated like 1.1.1.1 or 1.1.1.1|2.2.2.2.
The instance label however has the node:port 1.1.1.1:1234.
This edit makes it possible to select multiple nodes.

},
"editorMode": "code",
"expr": "sum by (cluster) (rate(networkobservability_drop_count[$__rate_interval]))",
"expr": "sum by (cluster) (rate(networkobservability_drop_count{instance=~\"($Nodes):[0-9]+\"}[$__rate_interval]))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and above queries are part of the "Fleet View" panels, where the dashboard summarizes metrics across clusters. I'm not sure it makes sense to filter based on node here, since the Nodes variable only contains nodes for the selected cluster (there is always exactly one cluster selected):

"name": "Nodes",
"options": [],
"query": {
"query": "label_values(kube_node_info{cluster=\"$cluster\"},node)",
"refId": "PrometheusVariableQueryEditor-VariableQuery"
},

There are some analogous panels below where someone can filter by node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that this dashboard is broken and not importable #271 (at least on my grafana setup). If you have a working version, would you actually be able to export it for sharing externally?
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved out from my previous job and has no access to a cluster where I can install retina right now. I'll try on a kind when back with my personal laptop in a few days and see how it goes.

"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": "/.*_podname=\"([^\"]*).*/",
"regex": "/.*podname=\"([^\"]*).*/",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch. You must have been using the advanced local-context metric mode. Just noting how this prompted initial thoughts on #344

@@ -107,7 +107,7 @@
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
"uid": "${DS_PROMETHEUS}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update datasource variable to DS_PROMETHEUS convention

Do you have a link to this convention? I just glanced at the top dashboard on Grafana.com, and it uses datasource rather than DS_PROMETHEUS as the variable. Seems the same for the built-in dashboards in Azure's managed Grafana.

I'm also afraid that this might be a breaking change to someone's existing dashboard setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've observed this "naming convention" widely used in recent years. If the dashboard needs to incorporate another type of datasource in the future, the name will clearly indicate the type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revert the change if you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Added some thoughts here #158 (review)

Copy link

github-actions bot commented Jun 4, 2024

This PR will be closed in 7 days due to inactivity.

@github-actions github-actions bot added the meta/waiting-for-author Blocked and waiting on the author label Jun 4, 2024
@nddq nddq removed the meta/waiting-for-author Blocked and waiting on the author label Jun 6, 2024
Copy link

github-actions bot commented Jul 7, 2024

This PR will be closed in 7 days due to inactivity.

@github-actions github-actions bot added meta/waiting-for-author Blocked and waiting on the author and removed meta/waiting-for-author Blocked and waiting on the author labels Jul 7, 2024
"hide": 0,
"includeAll": true,
"label": "Nodes",
"multi": true,
"name": "Nodes",
"options": [],
"query": {
"query": "label_values(kube_node_info{cluster=\"$cluster\"},node)",
"query": "label_values(kube_node_info{cluster=\"$cluster\"},internal_ip)",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use named capture groups here to separate displayed value and used value, e.g.

       "query": {
          "qryType": 3,
          "query": "query_result(kube_node_info)",
          "refId": "PrometheusVariableQueryEditor-VariableQuery"
        },
        "refresh": 2,
        "regex": "/node=\"(?<text>[^\"]+)|internal_ip=\"(?<value>[^\"]+)/g",

This would allow users to select nodes based on names but filter panels by underlying IPs

Copy link
Contributor

@huntergregory huntergregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dashboard files have moved since #432. Could you please apply the changes in the new files? Also, I think @cmergenthaler makes a good suggestion about displaying the node names.

If changing datasource.uid would break someone's dashboard when updating the dashboard to the new version, then I would prefer we not change it. Either way, it would be nice to keep this PR's scope smaller and make a change to datasource.uid in another PR (there are also test files that depend on that value).

Copy link

github-actions bot commented Sep 7, 2024

This PR will be closed in 7 days due to inactivity.

@github-actions github-actions bot added the meta/waiting-for-author Blocked and waiting on the author label Sep 7, 2024
Copy link

Pull request closed due to inactivity.

@github-actions github-actions bot closed this Sep 14, 2024
@aslafy-z
Copy link
Contributor Author

aslafy-z commented Sep 26, 2024

@rbtr @huntergregory @cmergenthaler
My priorities are shifting, I'm now working on another project without any Azure clusters on hand. Feel free to take over if you're able.

@aslafy-z aslafy-z deleted the patch-3 branch September 26, 2024 16:08
@rbtr
Copy link
Collaborator

rbtr commented Sep 26, 2024

@ibezrukavyi

@ibezrukavyi ibezrukavyi self-assigned this Sep 26, 2024
@aslafy-z aslafy-z restored the patch-3 branch September 26, 2024 20:06
@SRodi SRodi reopened this Sep 30, 2024
SRodi added a commit to SRodi/retina that referenced this pull request Sep 30, 2024
…e groups in clusters dash

Signed-off-by: Simone Rodigari <[email protected]>
SRodi added a commit to SRodi/retina that referenced this pull request Sep 30, 2024
…e groups in clusters dash

Signed-off-by: Simone Rodigari <[email protected]>
SRodi added a commit to SRodi/retina that referenced this pull request Sep 30, 2024
…re groups in clusters dash

Signed-off-by: Simone Rodigari <[email protected]>
@aslafy-z aslafy-z closed this Sep 30, 2024
@aslafy-z aslafy-z deleted the patch-3 branch September 30, 2024 17:48
github-merge-queue bot pushed a commit that referenced this pull request Oct 4, 2024
…dash (#797)

# Description

This PR is to fix #158

* reduce scope of PR
* [make it possible to select multiple nodes on clusters
dash](#158 (comment))
* [fix pod-level
regex](#158 (comment))
* [~~use named capture groups here to separate displayed value and used
value in clusters
dash~~](#158 (comment))

>NOTE: I have reverted the change to DS_PROMETHEUS not to break existing
deployments and tests. This was requested in [this
comment](#158 (review))


## Related Issue

fix #271


If this pull request is related to any issue, please mention it here.
Additionally, make sure that the issue is assigned to you before
submitting this pull request.

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/contributing).
- [x] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [x] I have updated the documentation, if necessary.
- [x] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

Please add any relevant screenshots or GIFs to showcase the changes
made.

### All dashboards

![Screenshot 2024-10-01
152822](https://github.com/user-attachments/assets/6b15f10d-dc12-4405-9898-7da59b2fcdd9)

![Screenshot 2024-10-01
152846](https://github.com/user-attachments/assets/5e1763ce-2a48-4dd9-b4c5-f2b52a7cb3d5)

![Screenshot 2024-10-01
152917](https://github.com/user-attachments/assets/3e4aab9d-7b44-4357-a709-d137e3bb8e47)

### Node selection fix

![Screenshot 2024-10-02
103738](https://github.com/user-attachments/assets/5b61ce34-6a1e-414b-8c9e-1b35f89f7efb)

![Screenshot 2024-10-02
103802](https://github.com/user-attachments/assets/529d6b9f-85a9-48e6-be52-252ecadd066b)

## Additional Notes

Thanks to @aslafy-z for the original PR
#158


---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Simone Rodigari <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/data-ingestion-and-visualization area/infra Test, Release, or CI Infrastructure meta/waiting-for-author Blocked and waiting on the author priority/1 P1 type/fix Fixes something
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants