Export task stream metrics in prometheus #7107

fjetter · 2022-10-05T09:43:10Z

#7088 triggered a discussion around exposing prometheus metrics that include the information to replicate a task stream-like grafana dashboard.
An earlier attempt exposing different information is shown in #7083

Both approaches include valuable information but fall short of actually reproducing what we're looking for. This issue is to track this effort and collect ideas.

fjetter · 2022-10-05T10:00:47Z

Progress bar

To expose the progress bar as we have it on the bokeh dashboard, we'd need current-tasks-in-state-by-prefix, e.g.

diff --git a/distributed/http/scheduler/prometheus/core.py b/distributed/http/scheduler/prometheus/core.py
index d8a8f00c5..46e962117 100644
--- a/distributed/http/scheduler/prometheus/core.py
+++ b/distributed/http/scheduler/prometheus/core.py
@@ -54,9 +54,17 @@ class SchedulerMetricCollector(PrometheusCollector):
             labels=["task_prefix_name"],
         )

+        prefix_states = GaugeMetricFamily(
+            self.build_name("prefix_states"),
+            "Current number of tasks in a given state by prefix",
+        )
+
         for tp in self.server.task_prefixes.values():
             suspicious_tasks.add_metric([tp.name], tp.suspicious)
+            for st, val in tp.states.items():
+                prefix_states.add_metric([tp.name, st], val)
         yield suspicious_tasks
+        yield prefix_states

See #7088 (comment)

Task stream

For the task stream (Assuming we want to expose the exact same info), I assume we could hook into the TaskStreamPlugin

which collects all this information and buffers it. The prometheus metrics collection would then expose this information (it's possible to define a custom timestamp for exposed metrics, there is a kwarg).

I'm slightly worried about the data volume but we'd need to test it. Implementation should be reasonably easy.

If data volume is an issue we may perform some aggregation, e.g. not exposing compute,disk,network by task but rather by prefix

mrocklin · 2022-10-05T15:12:57Z

For the task stream (Assuming we want to expose the exact same info), I assume we could hook into the TaskStreamPlugin

I don't want us to propagate the Task Stream up to prometheus. It's too much data and doesn't scale well. For larger clusters I've started using the TaskGroupProgress plugin, which I think gets a lot of the same information across in a more compact and scalable way.

The equivalent state is TaskPrefix.all_durations. I think that pushing that up should suffice.

fjetter mentioned this issue Oct 5, 2022

Count task states per task prefix and expose to Prometheus #7088

Merged

2 tasks

fjetter mentioned this issue Nov 25, 2022

Prometheus metrics improvements #7345

Open

9 tasks

fjetter added the diagnostics label Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export task stream metrics in prometheus #7107

Export task stream metrics in prometheus #7107

fjetter commented Oct 5, 2022

fjetter commented Oct 5, 2022 •

edited

Loading

mrocklin commented Oct 5, 2022

Export task stream metrics in prometheus #7107

Export task stream metrics in prometheus #7107

Comments

fjetter commented Oct 5, 2022

fjetter commented Oct 5, 2022 • edited Loading

Progress bar

Task stream

mrocklin commented Oct 5, 2022

fjetter commented Oct 5, 2022 •

edited

Loading