From 7f92953f1221431faa1f77a782b84472f0588bd8 Mon Sep 17 00:00:00 2001 From: David Espejo <82604841+davidmirror-ops@users.noreply.github.com> Date: Thu, 24 Oct 2024 16:46:11 -0500 Subject: [PATCH] Update monitoring docs (#5903) * Update refs to public dashboards and instructions Signed-off-by: davidmirror-ops * Fix group tab error v1 Signed-off-by: davidmirror-ops * Apply review suggestions Signed-off-by: davidmirror-ops --------- Signed-off-by: davidmirror-ops --- docs/deployment/configuration/monitoring.rst | 102 ++++++++++++------- 1 file changed, 65 insertions(+), 37 deletions(-) diff --git a/docs/deployment/configuration/monitoring.rst b/docs/deployment/configuration/monitoring.rst index 48239288f4..449c147754 100644 --- a/docs/deployment/configuration/monitoring.rst +++ b/docs/deployment/configuration/monitoring.rst @@ -5,7 +5,7 @@ Monitoring .. tags:: Infrastructure, Advanced -.. tip:: The Flyte core team publishes and maintains Grafana dashboards built using Prometheus data sources, which can be found `here `__. +.. tip:: The Flyte core team publishes and maintains Grafana dashboards built using Prometheus data sources. You can import them to your Grafana instance from the `Grafana marketplace `__. Metrics for Executions ====================== @@ -87,53 +87,81 @@ Flyte Backend is written in Golang and exposes stats using Prometheus. The stats Both ``flyteadmin`` and ``flytepropeller`` are instrumented to expose metrics. To visualize these metrics, Flyte provides three Grafana dashboards, each with a different focus: -- **User-facing dashboards**: Dashboards that can be used to triage/investigate/observe performance and characteristics of workflows and tasks. - The user-facing dashboard is published under ID `13980 `__ in the Grafana marketplace. +- **User-facing dashboard**: it can be used to investigate performance and characteristics of workflow and task executions. It's published under ID `22146 `__ in the Grafana marketplace. - **System Dashboards**: Dashboards that are useful for the system maintainer to investigate the status and performance of their Flyte deployments. These are further divided into: - - `DataPlane/FlytePropeller `__: execution engine status and performance. - - `ControlPlane/Flyteadmin `__: API-level monitoring. + - Data plane (``flytepropeller``): `21719 `__: execution engine status and performance. + - Control plane (``flyteadmin``): `21720 `__: API-level monitoring. -The corresponding JSON files for each dashboard are also located at ``deployment/stats/prometheus``. +The corresponding JSON files for each dashboard are also located in the ``flyte`` repository at `deployment/stats/prometheus `__. .. note:: The dashboards are basic dashboards and do not include all the metrics exposed by Flyte. Feel free to use the scripts provided `here `__ to improve and -hopefully- contribute the improved dashboards. -How to use the dashboards -~~~~~~~~~~~~~~~~~~~~~~~~~ - -1. We recommend installing and configuring the Prometheus operator as described in `their docs `__. -This is especially true if you plan to use the Service Monitors provided by the `flyte-core `__ Helm chart. - -2. Enable the Prometheus instance to use Service Monitors in the namespace where Flyte is running, configuring the following keys in the ``prometheus`` resource: - -.. code-block:: yaml - - spec: - serviceMonitorSelector: {} - serviceMonitorNamespaceSelector: {} - -.. note:: - - The above example configuration lets Prometheus use any ``ServiceMonitor`` in any namespace in the cluster. Adjust the configuration to reduce the scope if needed. - -3. Once you have installed and configured the Prometheus operator, enable the Service Monitors in the Helm chart by configuring the following keys in your ``values`` file: - -.. code-block:: yaml - - flyteadmin: - serviceMonitor: - enabled: true - - flytepropeller: - serviceMonitor: - enabled: true - +Setup instructions +~~~~~~~~~~~~~~~~~~ + +The dashboards rely on a working Prometheus deployment with access to your Kubernetes cluster and Flyte pods. +Additionally, the user dashboard uses metrics that come from ``kube-state-metrics``. Both of these requirements can be fulfilled by installing the `kube-prometheus-stack `__. + +Once the prerequisites are in place, follow the instructions in this section to configure metrics scraping for the corresponding Helm chart: + +.. tabs:: + + .. group-tab:: flyte-core + + Save the following in a ``flyte-monitoring-overrides.yaml`` file and run a ``helm upgrade`` operation pointing to that ``--values`` file: + + .. code-block:: yaml + + flyteadmin: + serviceMonitor: + enabled: true + labels: + release: kube-prometheus-stack #This is particular to the kube-prometheus-stacl + selectorLabels: + - app.kubernetes.io/name: flyteadmin + flytepropeller: + serviceMonitor: + enabled: true + labels: + release: kube-prometheus-stack + selectorLabels: + - app.kubernetes.io/name: flytepropeller + service: + enabled: true + + The above configuration enables the ``serviceMonitor`` that Prometheus can then use to automatically discover services and scrape metrics from them. + + .. group-tab:: flyte-binary + + Save the following in a ``flyte-monitoring-overrides.yaml`` file and run a ``helm upgrade`` operation pointing to that ``--values`` file: + + .. code-block:: yaml + + configuration: + inline: + propeller: + prof-port: 10254 + metrics-prefix: "flyte:" + scheduler: + profilerPort: 10254 + metricsScope: "flyte:" + flyteadmin: + profilerPort: 10254 + service: + extraPorts: + - name: http-metrics + protocol: TCP + port: 10254 + + The above configuration enables the ``serviceMonitor`` that Prometheus can then use to automatically discover services and scrape metrics from them. + .. note:: By default, the ``ServiceMonitor`` is configured with a ``scrapeTimeout`` of 30s and ``interval`` of 60s. You can customize these values if needed. -With the above configuration in place you should be able to import the dashboards in your Grafana instance. +With the above configuration completed, you should be able to import the dashboards in your Grafana instance.