-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document broken parts in current observability dashboards and metrics collection #822
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5323.
|
Current state of CKF-COS integrationTLDR; Most grafana dashboards do not work or have too little information. Alerts seem to work fine. They are healthy when the cluster is healthy too. However, I didn't find a way to reproduce all of them and verify that they are firing when they are needed. argo-controllerState: No issue in current implementation.
DexState: No issue in current implementation.
EnvoyState: Consideration about dashboard
istio-pilotState: issue in alerts.
jupyter-controllerState: Issue in grafana dashboard and considerations in alerts.
katib-controllerState: dashboard could be improved probably.
kfp-apiState: No issue in current implementation.
knative-charmsAt the moment, there is a metricsEndpoint configured but there are no alerts, neither a grafana dashboard. metacontroller-operatorState: No issue in current implementation.
MinioState: grafana dashboard doesn't seem to work.
Seldon-controller-managerState: grafana dashboard doesn't seem to work.
Training-operatorState: No issue in current implementation.
|
Regarding the fixing/redesigning of alerts and dashboards, some useful resources are:
|
Thanks for gathering this information @orfeas-k! Some comments:
Maybe we want to consider checking those alerts in future efforts, as we want to have alert rules that make sense. I imagine having the basic generic ones (like the unit available rule) and workload specific, but we have to make sure they work as expected, specially if we want to test them automatically in our CI or similar.
I think overall this information is good and gives us a good starting point for improving our integration with promethus and grafana. |
DocsIn general, https://charmed-kubeflow.io/docs/integrate-with-cos is outdated. Some key points about it:
|
Thank you @DnPlas for you comment. I already added on including things we should consider for the documentation.
|
Context
Document parts that are broken and need to be fixed in current observability dashboards and metrics collection.
What needs to get done
Above investigation
Definition of Done
Have a list of things that need to be fixed
The text was updated successfully, but these errors were encountered: