Document broken parts in current observability dashboards and metrics collection #822

orfeas-k · 2024-02-12T10:07:10Z

Context

Document parts that are broken and need to be fixed in current observability dashboards and metrics collection.

What needs to get done

Above investigation

Definition of Done

Have a list of things that need to be fixed

syncronize-issues-to-jira · 2024-02-12T10:07:18Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5323.

This message was autogenerated

orfeas-k · 2024-02-21T11:52:18Z

Current state of CKF-COS integration

TLDR; Most grafana dashboards do not work or have too little information. Alerts seem to work fine. They are healthy when the cluster is healthy too. However, I didn't find a way to reproduce all of them and verify that they are firing when they are needed.

argo-controller

State: No issue in current implementation.

Grafana dashboard
There is a grafana dashboard that looks to be working well.
Alerts
There are the following alerts:
- unit_unavailable.rule
- loglines_error.rule
- loglines_warning.rule
- workflows_erroring.rule
- workflows_failing.rule
- Workflows_pending.rule
All of them seem to follow upstream docs about workflows. I couldn't verify that they 're working apart from the unit_available.rule (did pebble stop argo-controller) since I didn't know how to trigger a workflow that would deterministically fail or log errors.

Dex

State: No issue in current implementation.

Alerts
There's a unit_available.rule that is working (did pebble stop dex).

Envoy

State: Consideration about dashboard

Grafana dashboard
- Its dashboard is working. There are a handful of metrics configures there which I 'm not sure if all are needed in our envoy case.
- The charm has an endpoint named grafana-dashboards instead of dashboard (without s) like all other charms.
Alerts
There's a unit_available.rule that is working (deleted its deployment).

istio-pilot

State: issue in alerts.

Alerts
There is an istio.rule. Its expression returns a {} which could mean that the alert is not configured properly. My guess is that it doesn't need the external avg(). On that point, Simme suggested here that we wouldn't want this in production.

jupyter-controller

State: Issue in grafana dashboard and considerations in alerts.

Grafana dashboard
Its dashboard doesn't seem to work. Exposes two metrics saying "No data".
Alerts
There are the following alerts:
- controller.rule
  - Not sure if this requires to specify the {}
- host_resources.rules
- model_errors.rule: This query returns 2 jobs.
- unit_available.rule
Considerations:
- Unit unavailable is in pending when the jupyter-controller pebble service is down. The alert didn't get the chance to fire since the service is restarted before 5 minutes is reached. However, this should be working.
- Alerts exist in the metrics returned when you curl the metrics endpoint. I 'm not sure ifcontroller.rule is out of date and we should consider adding {}. Curling the metrics endpoint returns the following
```
workqueue_unfinished_work_seconds{name="Culler"} 0
workqueue_unfinished_work_seconds{name="notebook"} 0
```

katib-controller

State: dashboard could be improved probably.

Grafana dashboard
There is a dashboard that shows current experiments and trials. However, it could be improved (depending on decision we take about grafana dashboards) given that katib-controller allows it.
Alerts
There's a unit_available.rule that is working (deleted its deployment).

kfp-api

State: No issue in current implementation.

Alerts
There's a unit_available.rule that is working (did pebble stop apiserver).

knative-charms

At the moment, there is a metricsEndpoint configured but there are no alerts, neither a grafana dashboard.

metacontroller-operator

State: No issue in current implementation.

Alerts
There's a unit_available.rule that is working (deleted its sts).

Minio

State: grafana dashboard doesn't seem to work.

Grafana dashboard
There's a grafana dashbaord that only shows N/A under all metrics.
Alerts
There's a unit_available.rule that is working (deleted its sts).

Seldon-controller-manager

State: grafana dashboard doesn't seem to work.

Grafana dashboard
The grafana dashboard doesn't seem to work. I applied two SeldonDeployments and the Models metric showed "No Data"
Alerts
- Unit_unavailable.rule
- seldon_errors.rule (that contains 5 alerts)
3 of the seldon_errors.rule are working when there are SeldonDeployments with errors. Not sure about WebhookError and UnfinishedWorkIncrease (couldn't figure out a way to reproduce the situation). However, all queries used are queries returned by the controller's metrics endpoint.

Training-operator

State: No issue in current implementation.

Alerts
There's a unit_available.rule that is working (did pebble stop training-operator).

orfeas-k · 2024-02-21T12:04:15Z

Regarding the fixing/redesigning of alerts and dashboards, some useful resources are:

Common prometheus alerts used for different apps: https://samber.github.io/awesome-prometheus-alerts/rules.html
Common grafana dashboards used for different apps: https://grafana.com/grafana/dashboards/
The four golden signals of monitoring (article and from Google SRE handbook). Didn't have the time to read this thoroughly yet but it came up by looking at argo's docs about exposing metrics.
Dashboard management maturity model

DnPlas · 2024-02-21T12:51:39Z

Thanks for gathering this information @orfeas-k! Some comments:

Should we also document here the things that are broken in our COS integration guide?
About this comment:

However, I didn't find a way to reproduce all of them and verify that they are firing when they are needed.

Maybe we want to consider checking those alerts in future efforts, as we want to have alert rules that make sense. I imagine having the basic generic ones (like the unit available rule) and workload specific, but we have to make sure they work as expected, specially if we want to test them automatically in our CI or similar.

I assume we are not documenting the charms that are yet to be integrated with prometheus and grafana in this issue, right? Do we have a different tracking issue?

I think overall this information is good and gives us a good starting point for improving our integration with promethus and grafana.

orfeas-k · 2024-02-21T14:11:20Z

Docs

In general, https://charmed-kubeflow.io/docs/integrate-with-cos is outdated. Some key points about it:

It, IIUC , takes as a requirement that both COS and CKF are deployed in the same cluster.
- This is not stated clearly if user is expected to have deployed both models in the same cluster. There is the statement "If Kubeflow and COS were both deployed to the same cluster, the controller will be the same for both models." which is not (always) true since we can deploy two different controllers in the same cluster.
- Deploying in the same cluster is against COS deploy in isolation best practice.
- Refers to juju 2.9 (which is incompatible with COS) but doesn't say anything about specific CKF version
- If both COS and CKF are expected to run in the same cluster, the minimum system requirements are not enough (I was using those with a larger disk and juju failed me twice)
It contains an outdate juju run-action command that doesn't work on juju 3.x versions.
There is not uniformity as to if the guide considers users are using two controllers or two models in one controller. This is visible since there are some redundant commands about switching models which take as a prerequisite that two models are deployed using the same controller while in other parts of the guide, it considers that we are using two different controllers.
It could be noted that Grafana agent charm is expected to go blocked since it requires at least one relation. We could document that.
We could be deploying with offers overlays instead of consuming offers afterwards, as mentioned here
Guide does not include steps to integrate the following charms with COS:
- Envoy
- Istio-pilot
- katib-controller grafana dashboard
- minio grafana dashboard

The view metrics parts of the guide are OK but considering that we want to move to have metrics for all CKF charms, they should be ditched for something more generic

orfeas-k · 2024-02-21T15:32:30Z

Thank you @DnPlas for you comment. I already added on including things we should consider for the documentation.

Maybe we want to consider checking those alerts in future efforts, as we want to have alert rules that make sense. I imagine having the basic generic ones (like the unit available rule) and workload specific, but we have to make sure they work as expected, specially if we want to test them automatically in our CI or similar.

I agree that we should be checking those alerts in future efforts, maybe we can discuss more as part of Explore/document shortcomings, pain points and similarities of observability integration testing #823. That being said, let's note that apart from defining what exactly is it that we want to test in that issue, we will also need to put effort figure out how to test each workload, aka reproduce a faulty situation, since it's not standard how each workload (be it an app or k8s controller or w/e) defines and triggers those metrics.
Very good point actually and is related to point number 2 too. I went ahead and created a tracker issue for grafana Charms to be integrated with grafana #834 and one for prometheus Charms to be integrated with prometheus #837

orfeas-k added the enhancement New feature or request label Feb 12, 2024

orfeas-k mentioned this issue Feb 21, 2024

Charms to be integrated with prometheus #837

Closed

36 tasks

orfeas-k mentioned this issue Feb 26, 2024

Write a spec about CKF charms integration with COS #842

Closed

orfeas-k closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document broken parts in current observability dashboards and metrics collection #822

Document broken parts in current observability dashboards and metrics collection #822

orfeas-k commented Feb 12, 2024

syncronize-issues-to-jira bot commented Feb 12, 2024

orfeas-k commented Feb 21, 2024 •

edited

Loading

orfeas-k commented Feb 21, 2024 •

edited

Loading

DnPlas commented Feb 21, 2024

orfeas-k commented Feb 21, 2024

orfeas-k commented Feb 21, 2024 •

edited

Loading

Document broken parts in current observability dashboards and metrics collection #822

Document broken parts in current observability dashboards and metrics collection #822

Comments

orfeas-k commented Feb 12, 2024

Context

What needs to get done

Definition of Done

syncronize-issues-to-jira bot commented Feb 12, 2024

orfeas-k commented Feb 21, 2024 • edited Loading

Current state of CKF-COS integration

argo-controller

Dex

Envoy

istio-pilot

jupyter-controller

katib-controller

kfp-api

knative-charms

metacontroller-operator

Minio

Seldon-controller-manager

Training-operator

orfeas-k commented Feb 21, 2024 • edited Loading

DnPlas commented Feb 21, 2024

orfeas-k commented Feb 21, 2024

Docs

orfeas-k commented Feb 21, 2024 • edited Loading

orfeas-k commented Feb 21, 2024 •

edited

Loading

orfeas-k commented Feb 21, 2024 •

edited

Loading

orfeas-k commented Feb 21, 2024 •

edited

Loading