Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document broken parts in current observability dashboards and metrics collection #822

Closed
orfeas-k opened this issue Feb 12, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@orfeas-k
Copy link
Contributor

Context

Document parts that are broken and need to be fixed in current observability dashboards and metrics collection.

What needs to get done

Above investigation

Definition of Done

Have a list of things that need to be fixed

@orfeas-k orfeas-k added the enhancement New feature or request label Feb 12, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5323.

This message was autogenerated

@orfeas-k
Copy link
Contributor Author

orfeas-k commented Feb 21, 2024

Current state of CKF-COS integration

TLDR; Most grafana dashboards do not work or have too little information. Alerts seem to work fine. They are healthy when the cluster is healthy too. However, I didn't find a way to reproduce all of them and verify that they are firing when they are needed.

argo-controller

State: No issue in current implementation.

  • Grafana dashboard
    There is a grafana dashboard that looks to be working well.

  • Alerts
    There are the following alerts:

    • unit_unavailable.rule
    • loglines_error.rule
    • loglines_warning.rule
    • workflows_erroring.rule
    • workflows_failing.rule
    • Workflows_pending.rule

    All of them seem to follow upstream docs about workflows. I couldn't verify that they 're working apart from the unit_available.rule (did pebble stop argo-controller) since I didn't know how to trigger a workflow that would deterministically fail or log errors.

Dex

State: No issue in current implementation.

  • Alerts
    There's a unit_available.rule that is working (did pebble stop dex).

Envoy

State: Consideration about dashboard

  • Grafana dashboard

    • Its dashboard is working. There are a handful of metrics configures there which I 'm not sure if all are needed in our envoy case.
    • The charm has an endpoint named grafana-dashboards instead of dashboard (without s) like all other charms.
  • Alerts
    There's a unit_available.rule that is working (deleted its deployment).

istio-pilot

State: issue in alerts.

  • Alerts
    There is an istio.rule. Its expression returns a {} which could mean that the alert is not configured properly. My guess is that it doesn't need the external avg(). On that point, Simme suggested here that we wouldn't want this in production.

jupyter-controller

State: Issue in grafana dashboard and considerations in alerts.

  • Grafana dashboard
    Its dashboard doesn't seem to work. Exposes two metrics saying "No data".

  • Alerts
    There are the following alerts:

    • controller.rule
      • Not sure if this requires to specify the {}
    • host_resources.rules
    • model_errors.rule: This query returns 2 jobs.
    • unit_available.rule

    Considerations:

    • Unit unavailable is in pending when the jupyter-controller pebble service is down. The alert didn't get the chance to fire since the service is restarted before 5 minutes is reached. However, this should be working.
    • Alerts exist in the metrics returned when you curl the metrics endpoint. I 'm not sure ifcontroller.rule is out of date and we should consider adding {}. Curling the metrics endpoint returns the following
      workqueue_unfinished_work_seconds{name="Culler"} 0
      workqueue_unfinished_work_seconds{name="notebook"} 0
      

katib-controller

State: dashboard could be improved probably.

  • Grafana dashboard
    There is a dashboard that shows current experiments and trials. However, it could be improved (depending on decision we take about grafana dashboards) given that katib-controller allows it.

  • Alerts
    There's a unit_available.rule that is working (deleted its deployment).

kfp-api

State: No issue in current implementation.

  • Alerts
    There's a unit_available.rule that is working (did pebble stop apiserver).

knative-charms

At the moment, there is a metricsEndpoint configured but there are no alerts, neither a grafana dashboard.

metacontroller-operator

State: No issue in current implementation.

  • Alerts
    There's a unit_available.rule that is working (deleted its sts).

Minio

State: grafana dashboard doesn't seem to work.

  • Grafana dashboard
    There's a grafana dashbaord that only shows N/A under all metrics.

  • Alerts
    There's a unit_available.rule that is working (deleted its sts).

Seldon-controller-manager

State: grafana dashboard doesn't seem to work.

  • Grafana dashboard
    The grafana dashboard doesn't seem to work. I applied two SeldonDeployments and the Models metric showed "No Data"

  • Alerts

    • Unit_unavailable.rule
    • seldon_errors.rule (that contains 5 alerts)

    3 of the seldon_errors.rule are working when there are SeldonDeployments with errors. Not sure about WebhookError and UnfinishedWorkIncrease (couldn't figure out a way to reproduce the situation). However, all queries used are queries returned by the controller's metrics endpoint.

Training-operator

State: No issue in current implementation.

  • Alerts
    There's a unit_available.rule that is working (did pebble stop training-operator).

@orfeas-k
Copy link
Contributor Author

orfeas-k commented Feb 21, 2024

Regarding the fixing/redesigning of alerts and dashboards, some useful resources are:

@DnPlas
Copy link
Contributor

DnPlas commented Feb 21, 2024

Thanks for gathering this information @orfeas-k! Some comments:

  1. Should we also document here the things that are broken in our COS integration guide?

  2. About this comment:

However, I didn't find a way to reproduce all of them and verify that they are firing when they are needed.

Maybe we want to consider checking those alerts in future efforts, as we want to have alert rules that make sense. I imagine having the basic generic ones (like the unit available rule) and workload specific, but we have to make sure they work as expected, specially if we want to test them automatically in our CI or similar.

  1. I assume we are not documenting the charms that are yet to be integrated with prometheus and grafana in this issue, right? Do we have a different tracking issue?

I think overall this information is good and gives us a good starting point for improving our integration with promethus and grafana.

@orfeas-k
Copy link
Contributor Author

Docs

In general, https://charmed-kubeflow.io/docs/integrate-with-cos is outdated. Some key points about it:

  • It, IIUC , takes as a requirement that both COS and CKF are deployed in the same cluster.
    • This is not stated clearly if user is expected to have deployed both models in the same cluster. There is the statement "If Kubeflow and COS were both deployed to the same cluster, the controller will be the same for both models." which is not (always) true since we can deploy two different controllers in the same cluster.
    • Deploying in the same cluster is against COS deploy in isolation best practice.
    • Refers to juju 2.9 (which is incompatible with COS) but doesn't say anything about specific CKF version
    • If both COS and CKF are expected to run in the same cluster, the minimum system requirements are not enough (I was using those with a larger disk and juju failed me twice)
  • It contains an outdate juju run-action command that doesn't work on juju 3.x versions.
  • There is not uniformity as to if the guide considers users are using two controllers or two models in one controller. This is visible since there are some redundant commands about switching models which take as a prerequisite that two models are deployed using the same controller while in other parts of the guide, it considers that we are using two different controllers.
  • It could be noted that Grafana agent charm is expected to go blocked since it requires at least one relation. We could document that.
  • We could be deploying with offers overlays instead of consuming offers afterwards, as mentioned here
  • Guide does not include steps to integrate the following charms with COS:
    • Envoy
    • Istio-pilot
    • katib-controller grafana dashboard
    • minio grafana dashboard
  • The view metrics parts of the guide are OK but considering that we want to move to have metrics for all CKF charms, they should be ditched for something more generic

@orfeas-k
Copy link
Contributor Author

orfeas-k commented Feb 21, 2024

Thank you @DnPlas for you comment. I already added on including things we should consider for the documentation.

  1. Maybe we want to consider checking those alerts in future efforts, as we want to have alert rules that make sense. I imagine having the basic generic ones (like the unit available rule) and workload specific, but we have to make sure they work as expected, specially if we want to test them automatically in our CI or similar.

    I agree that we should be checking those alerts in future efforts, maybe we can discuss more as part of Explore/document shortcomings, pain points and similarities of observability integration testing  #823. That being said, let's note that apart from defining what exactly is it that we want to test in that issue, we will also need to put effort figure out how to test each workload, aka reproduce a faulty situation, since it's not standard how each workload (be it an app or k8s controller or w/e) defines and triggers those metrics.

  2. Very good point actually and is related to point number 2 too. I went ahead and created a tracker issue for grafana Charms to be integrated with grafana #834 and one for prometheus Charms to be integrated with prometheus  #837

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants