Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export task stream metrics in prometheus #7107

Open
Tracked by #7345
fjetter opened this issue Oct 5, 2022 · 2 comments
Open
Tracked by #7345

Export task stream metrics in prometheus #7107

fjetter opened this issue Oct 5, 2022 · 2 comments

Comments

@fjetter
Copy link
Member

fjetter commented Oct 5, 2022

#7088 triggered a discussion around exposing prometheus metrics that include the information to replicate a task stream-like grafana dashboard.
An earlier attempt exposing different information is shown in #7083

Both approaches include valuable information but fall short of actually reproducing what we're looking for. This issue is to track this effort and collect ideas.

@fjetter
Copy link
Member Author

fjetter commented Oct 5, 2022

Progress bar

To expose the progress bar as we have it on the bokeh dashboard, we'd need current-tasks-in-state-by-prefix, e.g.

diff --git a/distributed/http/scheduler/prometheus/core.py b/distributed/http/scheduler/prometheus/core.py
index d8a8f00c5..46e962117 100644
--- a/distributed/http/scheduler/prometheus/core.py
+++ b/distributed/http/scheduler/prometheus/core.py
@@ -54,9 +54,17 @@ class SchedulerMetricCollector(PrometheusCollector):
             labels=["task_prefix_name"],
         )

+        prefix_states = GaugeMetricFamily(
+            self.build_name("prefix_states"),
+            "Current number of tasks in a given state by prefix",
+        )
+
         for tp in self.server.task_prefixes.values():
             suspicious_tasks.add_metric([tp.name], tp.suspicious)
+            for st, val in tp.states.items():
+                prefix_states.add_metric([tp.name, st], val)
         yield suspicious_tasks
+        yield prefix_states

See #7088 (comment)

Task stream

For the task stream (Assuming we want to expose the exact same info), I assume we could hook into the TaskStreamPlugin

which collects all this information and buffers it. The prometheus metrics collection would then expose this information (it's possible to define a custom timestamp for exposed metrics, there is a kwarg).

I'm slightly worried about the data volume but we'd need to test it. Implementation should be reasonably easy.

If data volume is an issue we may perform some aggregation, e.g. not exposing compute,disk,network by task but rather by prefix

@mrocklin
Copy link
Member

mrocklin commented Oct 5, 2022

For the task stream (Assuming we want to expose the exact same info), I assume we could hook into the TaskStreamPlugin

I don't want us to propagate the Task Stream up to prometheus. It's too much data and doesn't scale well. For larger clusters I've started using the TaskGroupProgress plugin, which I think gets a lot of the same information across in a more compact and scalable way.

The equivalent state is TaskPrefix.all_durations. I think that pushing that up should suffice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants