Count task states per task prefix and expose to Prometheus #7088

ntabris · 2022-09-29T19:02:38Z

Keeps an accumulated count on task prefix of number of tasks that have been in each state, and exposes this on Prometheus endpoint.

Here's an example plot of this data in Grafana:

(this is plot of increase(dask_scheduler_prefix_state_totals_total{cluster="$cluster",state="memory"}[10s]))

Tests added / passed
Passes pre-commit run --all-files

GPUtester · 2022-09-29T19:02:39Z

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

mrocklin · 2022-09-30T14:16:46Z

Aside from the "foo" comment this all seems pretty clean to me. I recommend adding a screenshot of the prometheus chart that this enables for motivation.

cc'ing @fjetter for when he gets back. Maybe also @crusaderky in the meantime?

mrocklin · 2022-09-30T14:17:42Z

Supercedes #7083

mrocklin · 2022-09-30T14:18:07Z

cc also @jacobtomlinson , although I suspect he won't have much to say about the implementation

github-actions · 2022-09-30T15:53:21Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 6h 37m 42s ⏱️ + 14m 6s
  3 155 tests +  2   3 069 ✔️ ±  0   85 💤 +1 1 ❌ +1
23 350 runs +15 22 441 ✔️ +11 908 💤 +3 1 ❌ +1

For more details on these failures, see this check.

Results for commit 299ff1d. ± Comparison against base commit 044225a.

♻️ This comment has been updated with latest results.

quasiben · 2022-09-30T16:30:54Z

add to allowlist

ntabris · 2022-09-30T22:16:54Z

The test failure on that's specifically py38 on win doesn't appear to be related to anything I changed. Can someone help with this?

fjetter

I'm OK with the changes but I am wondering if totals are actually the most useful counter here. Wouldn't the current value be more useful?

ntabris · 2022-10-03T13:04:06Z

I am wondering if totals are actually the most useful counter here. Wouldn't the current value be more useful?

If you collect current values, you'll lose what happens between the times that you sample (i.e., scrape the prometheus endpoint).

With a counter, it's easy to get the number of tasks that transitioned for the entire sample interval—plot the difference between the counts. That seems more useful if you're sampling at interval that's relatively large to the duration of task transitions, and it more and more approximates the current value as your sampling (i.e., scraping) interval gets smaller.

fjetter · 2022-10-03T13:53:48Z

With a counter, it's easy to get the number of tasks that transitioned for the entire sample interval—plot the difference between the counts. That seems more useful if you're sampling at interval that's relatively large to the duration of task transitions, and it more and more approximates the current value as your sampling (i.e., scraping) interval gets smaller.

That's true but I'm typically interested at the current value at a given point in time (i.e. rather a gauge in the prometheus sense). To make that graph you are showing you need to calculate a differences between states, don't you? How does the formula look like that you're plotting?

ntabris · 2022-10-03T14:00:15Z

That's true but I'm typically interested at the current value at a given point in time (i.e. rather a gauge in the prometheus sense). To make that graph you are showing you need to calculate a differences between states, don't you? How does the formula look like that you're plotting?

Something like increase(dask_scheduler_prefix_state_totals_total{cluster="$cluster",state="processing"}[10s]).

fjetter · 2022-10-03T14:15:23Z

That's true but I'm typically interested at the current value at a given point in time (i.e. rather a gauge in the prometheus sense). To make that graph you are showing you need to calculate a differences between states, don't you? How does the formula look like that you're plotting?

Something like increase(dask_scheduler_prefix_state_totals_total{cluster="$cluster",state="processing"}[10s]).

So the plot doesn't show prefixes in state processing but rather prefixes that were transitioned to processing in the last 10s. IIUC the following would spike once and then be flat for the entire duration of the computation. Is this what you intend to show?

import time

def do_stuff(x, delay):
    time.sleep(delay)

client.map(do_stuff, range(10), delay=60)

mrocklin · 2022-10-03T21:45:28Z

I think that we actually want to show the derivative of tasks that have transitioned into the memory state. This is closer to what the task stream represents today.

ntabris · 2022-10-03T22:41:55Z

I think that we actually want to show the derivative of tasks that have transitioned into the memory state. This is closer to what the task stream represents today.

This is easy to get with the metrics exposed here.

So the plot doesn't show prefixes in state processing but rather prefixes that were transitioned to processing in the last 10s. IIUC the following would spike once and then be flat for the entire duration of the computation. Is this what you intend to show?

Pretty much. (I think it would actually be transition in last 5s, since with 5s sample interval I think this gives you the difference of two samples that are 5s apart, but I'm not highly confident about this.)

Stepping back, there are three related numbers:

How fast tasks (for a given prefix) are finishing (transitioning into memory state)
How many tasks (for a given prefix) are being processed
How long the tasks take to process

This PR exposes (1). @fjetter, I think you're asking for (2), and noting that (1) and (2) come apart more when the time to complete individual tasks is relatively long. Is that right?

(1) and (2) are also interesting different when you're comparing tasks that take significantly different lengths of time to complete... for example, if you see that task prefix A is completing tasks much faster than task prefix B, you might want to know if this is because B is waiting on A, or because A is just a fast thing to do.

My sense here is that (1) is valuable, and (2) or (3) or both would also be valuable, but that having (2) wouldn't remove the value of also having (1). Is that wrong?

mrocklin · 2022-10-04T00:31:11Z

(2) was done, I think, in #7083

It was good! But I think that this is better. Sampling at 5s felt lossy. This feels like it gives the same information, and without loss. The reason for (2) over (1) I think is that it doesn't require any additional state in the scheduler, but the state here seems fairly innocuous.

fjetter · 2022-10-04T08:56:13Z

How fast tasks (for a given prefix) are finishing (transitioning into memory state)

Yes, I see value in this derivative. Earlier I've been using the scheduler_tasks Gauge to infer progress on a computation and the slope was often a very important diagnostics tool. Thinking about it a bit longer, I'm actually pretty excited about the prospect of having this by task prefix.
I think I was thrown off by the graph title which I would choose differently.

What I was thinking about is more along the lines of this scheduler_tasks gauge but split up by prefixes.

An earlier attempt to build visualizations on task groups is for instance this PR #5320 This PR used durations which is a much harder problem since we get duration measurements only for historic data. not sure if we can publish prometheus metrics with a custom timestamp, can we?

A similar but simpler visualization could be built if we simply exposed the current state as a gauge and create such a stacked chart w/out durations (not as powerful but a good first step).

I actually think it is best to just add both and see how we can make use of them. The only problem I see is that we do not have an easy way to manage deprecations with these metrics but I'm OK with that for now.

Explicitly, I am proposing to add

diff --git a/distributed/http/scheduler/prometheus/core.py b/distributed/http/scheduler/prometheus/core.py
index d8a8f00c5..46e962117 100644
--- a/distributed/http/scheduler/prometheus/core.py
+++ b/distributed/http/scheduler/prometheus/core.py
@@ -54,9 +54,17 @@ class SchedulerMetricCollector(PrometheusCollector):
             labels=["task_prefix_name"],
         )

+        prefix_states = GaugeMetricFamily(
+            self.build_name("prefix_states"),
+            "Current number of tasks in a given state by prefix",
+        )
+
         for tp in self.server.task_prefixes.values():
             suspicious_tasks.add_metric([tp.name], tp.suspicious)
+            for st, val in tp.states.items():
+                prefix_states.add_metric([tp.name, st], val)
         yield suspicious_tasks
+        yield prefix_states

With this we could actually also "deprecate" the tasks Gauge since this prefix_states includes the same info (after an aggregation)

I suggest to do this in a separate PR.

ntabris · 2022-10-04T16:40:27Z

I'm actually pretty excited about the prospect of having this by task prefix.

Yay! I've also updated the example plot in the PR description.

not sure if we can publish prometheus metrics with a custom timestamp, can we?

I'm not currently aware of a good way to do something like that, but I'm thinking about this.

A similar but simpler visualization could be built if we simply exposed the current state as a gauge and create such a stacked chart w/out durations (not as powerful but a good first step). [...] I suggest to do this in a separate PR.

Sounds good, agree that it makes sense to do this separately.

jrbourbeau

Just noting that the distributed/dashboard/tests/test_scheduler_bokeh.py::test_simple failure is unrelated to the changes here and has been fixed in #7096

jrbourbeau

Just checking in -- it looks like all outstanding review comments have been addressed (there's follow-up work to do, just not in this PR). @fjetter @mrocklin, mind giving this a final review?

mrocklin · 2022-10-04T20:06:10Z

I've gone through things again and everything seems fine to me. I'm +1 on merging.

However, after reading through comments and thinking a bit more about this I realize that we are losing duration information. I didn't realize this in the first pass. We had this implicitly when we were sampling. I didn't think about how losing sampling was also causing us to lose am implicit weighting on longer tasks.

I expect we'll want something slightly different if we're looking to recreate the task group progress chart (which I do think we want). Looking at the code I think that we maybe already have this state in TaskPrefix.all_durations["compute"]. I suspect that if we were to publish that as a counter that that would give us the chart that we want.

(if so, hopefully @ntabris doesn't mind the previous exercise as a good learning experience)

I think that there are now two options:

If @ntabris is ready to be done then we merge this and handle the rest in follow-up work. I'm +1 to hit the green button.
If @ntabris wants to do a bit more work then we also publish all_durations["compute"] and focus on that for near-term charting.

mrocklin · 2022-10-04T20:22:44Z

then we also publish all_durations["compute"] and focus on that for near-term charting

If metrics are cheap then maybe we publish all of all_durations, similar to how we do now with states. This would give us transfer, serialization, and disk costs as Dask sees them. This is maybe a bit redundant with actual hardware metrics, so of only modest value.

fjetter · 2022-10-05T10:01:56Z

I still think this is worth having. The rate of tasks in a given state is still interesting information. I could see average or max rate to be a valuable metric to compare performance related changes and it is definitely a cheap metric to have.

I opened #7107 for a follow up conversation about the task stream

fjetter

Thanks @ntabris and congratulations for your first contribution!

Count task states per task prefix and expose to Prometheus

d5ec21c

mrocklin mentioned this pull request Sep 30, 2022

WIP - Add task prefixes to prometheus metrics #7083

Closed

fix comment

1020d5f

ntabris marked this pull request as ready for review September 30, 2022 14:23

Merge branch 'main' into prom_task_prefix_counts

91ff48e

jacobtomlinson approved these changes Sep 30, 2022

View reviewed changes

fix test

b74f03a

fjetter reviewed Oct 3, 2022

View reviewed changes

jrbourbeau reviewed Oct 4, 2022

View reviewed changes

Merge branch 'main' into prom_task_prefix_counts

299ff1d

jrbourbeau reviewed Oct 4, 2022

View reviewed changes

fjetter mentioned this pull request Oct 5, 2022

Export task stream metrics in prometheus #7107

Open

fjetter approved these changes Oct 5, 2022

View reviewed changes

fjetter merged commit 61353b7 into dask:main Oct 5, 2022

gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Count task states per task prefix and expose to Prometheus (dask#7088)

7a3b3cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count task states per task prefix and expose to Prometheus #7088

Count task states per task prefix and expose to Prometheus #7088

ntabris commented Sep 29, 2022 •

edited

Loading

GPUtester commented Sep 29, 2022

mrocklin commented Sep 30, 2022

mrocklin commented Sep 30, 2022

mrocklin commented Sep 30, 2022

github-actions bot commented Sep 30, 2022 •

edited

Loading

quasiben commented Sep 30, 2022

ntabris commented Sep 30, 2022

fjetter left a comment

ntabris commented Oct 3, 2022

fjetter commented Oct 3, 2022

ntabris commented Oct 3, 2022

fjetter commented Oct 3, 2022

mrocklin commented Oct 3, 2022

ntabris commented Oct 3, 2022

mrocklin commented Oct 4, 2022

fjetter commented Oct 4, 2022

ntabris commented Oct 4, 2022

jrbourbeau left a comment

jrbourbeau left a comment

mrocklin commented Oct 4, 2022

mrocklin commented Oct 4, 2022

fjetter commented Oct 5, 2022

fjetter left a comment

Count task states per task prefix and expose to Prometheus #7088

Count task states per task prefix and expose to Prometheus #7088

Conversation

ntabris commented Sep 29, 2022 • edited Loading

GPUtester commented Sep 29, 2022

mrocklin commented Sep 30, 2022

mrocklin commented Sep 30, 2022

mrocklin commented Sep 30, 2022

github-actions bot commented Sep 30, 2022 • edited Loading

Unit Test Results

quasiben commented Sep 30, 2022

ntabris commented Sep 30, 2022

fjetter left a comment

Choose a reason for hiding this comment

ntabris commented Oct 3, 2022

fjetter commented Oct 3, 2022

ntabris commented Oct 3, 2022

fjetter commented Oct 3, 2022

mrocklin commented Oct 3, 2022

ntabris commented Oct 3, 2022

mrocklin commented Oct 4, 2022

fjetter commented Oct 4, 2022

ntabris commented Oct 4, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

mrocklin commented Oct 4, 2022

mrocklin commented Oct 4, 2022

fjetter commented Oct 5, 2022

fjetter left a comment

Choose a reason for hiding this comment

ntabris commented Sep 29, 2022 •

edited

Loading

github-actions bot commented Sep 30, 2022 •

edited

Loading