-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export task stream metrics in prometheus #7107
Comments
Progress barTo expose the progress bar as we have it on the bokeh dashboard, we'd need current-tasks-in-state-by-prefix, e.g. diff --git a/distributed/http/scheduler/prometheus/core.py b/distributed/http/scheduler/prometheus/core.py
index d8a8f00c5..46e962117 100644
--- a/distributed/http/scheduler/prometheus/core.py
+++ b/distributed/http/scheduler/prometheus/core.py
@@ -54,9 +54,17 @@ class SchedulerMetricCollector(PrometheusCollector):
labels=["task_prefix_name"],
)
+ prefix_states = GaugeMetricFamily(
+ self.build_name("prefix_states"),
+ "Current number of tasks in a given state by prefix",
+ )
+
for tp in self.server.task_prefixes.values():
suspicious_tasks.add_metric([tp.name], tp.suspicious)
+ for st, val in tp.states.items():
+ prefix_states.add_metric([tp.name, st], val)
yield suspicious_tasks
+ yield prefix_states See #7088 (comment) Task streamFor the task stream (Assuming we want to expose the exact same info), I assume we could hook into the which collects all this information and buffers it. The prometheus metrics collection would then expose this information (it's possible to define a custom timestamp for exposed metrics, there is a kwarg). I'm slightly worried about the data volume but we'd need to test it. Implementation should be reasonably easy. If data volume is an issue we may perform some aggregation, e.g. not exposing compute,disk,network by task but rather by prefix |
I don't want us to propagate the Task Stream up to prometheus. It's too much data and doesn't scale well. For larger clusters I've started using the TaskGroupProgress plugin, which I think gets a lot of the same information across in a more compact and scalable way. The equivalent state is |
#7088 triggered a discussion around exposing prometheus metrics that include the information to replicate a task stream-like grafana dashboard.
An earlier attempt exposing different information is shown in #7083
Both approaches include valuable information but fall short of actually reproducing what we're looking for. This issue is to track this effort and collect ideas.
The text was updated successfully, but these errors were encountered: