Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Dashboard: Simple Cluster Status Overview #2343

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Zash
Copy link
Contributor

@Zash Zash commented Nov 14, 2024

  • apps: copy blackbox exporter dashboard as base for new simple status
  • apps: rework cluster-status overview

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • kind/adr

What does this PR do / why do we need this PR?

Adds a new dashboard intended to provide a better user-facing overview of the overall state of the cluster.
It should answer the question "Is the cluster working?" and "Has there been recent problems?"
It's based on the Prometheus Blackbox Exporter Dashboard, with some panels at the top added and tweaked.

Information to reviewers

  • Does it tell you if the cluster is working?

  • Does it tell you if the cluster is has had recent issues?

  • Are there any other indicators

  • Should this be merged back into the Prometheus Blackbox Exporter?

image

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change upgrades CRDs
    • The change updates the config and the schema
  • Documentation checks:
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts are not affected)
    • The metrics names did change (Grafana dashboards and Prometheus alerts were fixed)
  • Logs checks:
    • The logs do not show any errors after the change
  • Pod Security Policy checks:
    • Any changed pod is covered by Pod Security Admission
    • Any changed pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any pods to be blocked by Pod Security Admission or Policies
  • Network Policy checks:
    • Any changed pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

Zash added 2 commits November 14, 2024 10:52
Tweaks the probe success timeline to be easier to read

Some Pod status that may be relevant

Last 24h status based on feedback Viktor
@Zash Zash force-pushed the ka/simple-uptime-dashboard branch from 57e5c10 to dac57df Compare November 18, 2024 09:36
Copy link
Contributor

@viktor-f viktor-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. I'm in contact with a user as well to get more feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I realized that this will show the probes that are present in the datasource one has selected. For user grafana this means that it will default to the probes in wc. That might be good and it is showing most of the important services in sc as well. But I wonder if it will be obvious to users that you could find more/other probes by switching to the service cluster datasource.
Do you have any thoughts about this? Should there be a text box explaining this (and potentially other things about the dashboard)?

@viktor-f
Copy link
Contributor

Thanks for adding this. I'm in contact with a user as well to get more feedback.

I got word that the user is on vacation and will be back in 2 weeks. So expect some feedback from them after that.

Copy link
Contributor

@viktor-f viktor-f Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some user feedback (adding it as a file comment to make it a thread):

I took a look at the dashboard when I was running some database jobs last week. I think I came in with the wrong expectations, because the dashboard is clearly not targeting that use-case.

The latency-graphs for the systems that are in the dashboard seems really good for pinpointing/eliminating those systems as culprits for performance problems. The most immediate panel “in the same spirit” that I felt was missing was S3. I don’t know how you probe so it’s possible a misbehaving storage solution could be seen through spiking times for Harbor or similar, but it would be better to have the status for the storage solution in the same way as for the other systems so one doesn’t have to infer the health of that.

A small thing is that the namespace selector at the top doesn’t seem to have any effect. I don’t know how hard it is to remove that and other UI elements that aren’t used, but I think that would make it slightly easier to use the dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simple uptime and status dashboard [goto-monitoring]
2 participants