New Dashboard: Simple Cluster Status Overview #2343

Zash · 2024-11-14T10:46:11Z

apps: copy blackbox exporter dashboard as base for new simple status
apps: rework cluster-status overview

Warning

This is a public repository, ensure not to disclose:

personal data beyond what is necessary for interacting with this pull request, nor
business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

kind/admin-change
kind/dev-change
kind/security
kind/adr

What does this PR do / why do we need this PR?

Adds a new dashboard intended to provide a better user-facing overview of the overall state of the cluster.
It should answer the question "Is the cluster working?" and "Has there been recent problems?"
It's based on the Prometheus Blackbox Exporter Dashboard, with some panels at the top added and tweaked.

Fixes Simple uptime and status dashboard [goto-monitoring] #2325

Information to reviewers

Does it tell you if the cluster is working?
Does it tell you if the cluster is has had recent issues?
Are there any other indicators
Should this be merged back into the Prometheus Blackbox Exporter?

Checklist

Tweaks the probe success timeline to be easier to read Some Pod status that may be relevant Last 24h status based on feedback Viktor

viktor-f

Thanks for adding this. I'm in contact with a user as well to get more feedback.

viktor-f · 2024-11-18T13:45:25Z

helmfile.d/charts/grafana-dashboards/dashboards/cluster-status-dashboard.json

Question: I realized that this will show the probes that are present in the datasource one has selected. For user grafana this means that it will default to the probes in wc. That might be good and it is showing most of the important services in sc as well. But I wonder if it will be obvious to users that you could find more/other probes by switching to the service cluster datasource.
Do you have any thoughts about this? Should there be a text box explaining this (and potentially other things about the dashboard)?

viktor-f · 2024-11-19T13:39:04Z

Thanks for adding this. I'm in contact with a user as well to get more feedback.

I got word that the user is on vacation and will be back in 2 weeks. So expect some feedback from them after that.

viktor-f · 2024-12-09T10:05:41Z

helmfile.d/charts/grafana-dashboards/dashboards/cluster-status-dashboard.json

Some user feedback (adding it as a file comment to make it a thread):

I took a look at the dashboard when I was running some database jobs last week. I think I came in with the wrong expectations, because the dashboard is clearly not targeting that use-case.

The latency-graphs for the systems that are in the dashboard seems really good for pinpointing/eliminating those systems as culprits for performance problems. The most immediate panel “in the same spirit” that I felt was missing was S3. I don’t know how you probe so it’s possible a misbehaving storage solution could be seen through spiking times for Harbor or similar, but it would be better to have the status for the storage solution in the same way as for the other systems so one doesn’t have to infer the health of that.

A small thing is that the namespace selector at the top doesn’t seem to have any effect. I don’t know how hard it is to remove that and other UI elements that aren’t used, but I think that would make it slightly easier to use the dashboard.

Zash added 2 commits November 14, 2024 10:52

apps: copy blackbox exporter dashboard as base for new simple status

bf08415

apps: rework cluster-status overview

dac57df

Tweaks the probe success timeline to be easier to read Some Pod status that may be relevant Last 24h status based on feedback Viktor

Zash force-pushed the ka/simple-uptime-dashboard branch from 57e5c10 to dac57df Compare November 18, 2024 09:36

viktor-f reviewed Nov 18, 2024

View reviewed changes

viktor-f reviewed Dec 9, 2024

View reviewed changes

Zash added 2 commits January 15, 2025 13:01

TODO

3952fb8

default to 24h view, 5m refresh, remove interval

6c668bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Dashboard: Simple Cluster Status Overview #2343

New Dashboard: Simple Cluster Status Overview #2343

Zash commented Nov 14, 2024 •

edited

Loading

viktor-f left a comment

viktor-f Nov 18, 2024

viktor-f commented Nov 19, 2024

viktor-f Dec 9, 2024 •

edited

Loading

New Dashboard: Simple Cluster Status Overview #2343

Are you sure you want to change the base?

New Dashboard: Simple Cluster Status Overview #2343

Conversation

Zash commented Nov 14, 2024 • edited Loading

What kind of PR is this?

What does this PR do / why do we need this PR?

Information to reviewers

Checklist

viktor-f left a comment

Choose a reason for hiding this comment

viktor-f Nov 18, 2024

Choose a reason for hiding this comment

viktor-f commented Nov 19, 2024

viktor-f Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Zash commented Nov 14, 2024 •

edited

Loading

viktor-f Dec 9, 2024 •

edited

Loading