Use metering for P2P shuffling instrumentation #7943

fjetter · 2023-06-22T12:45:45Z

The P2P extensions (primarily the buffers) are implementing their own version for diagnostics, e.g. here

distributed/distributed/shuffle/_buffer.py

Lines 267 to 272 in 429ef8c

    
           @contextlib.contextmanager 
        
           def time(self, name: str) -> Iterator[None]: 
        
               start = time() 
        
               yield 
        
               stop = time() 
        
               self.diagnostics[name] += stop - start

I think P2P diagnostics would greatly benefit from the metering / fine grained metrics and we should consider replacing (os extending) the custom ctx managers there with the metrics.meter (mind that there is also a special dashboard for P2P shuffling that is using this custom diagnostics. I don't want to break this dashboard).

cc @crusaderky @hendrikmakait

crusaderky · 2023-11-19T23:07:46Z

I'm delivering this ticket in 4 incremental PRs for ease of review.
Each PR incorporates the previous ones. While they make sense on their own, I would encourage reviewing them holistically.

Shuffle metrics 1/4: Add foreground metrics #8364
This adds Fine Performance Metrics to everything except the background tasks. Everything is logged under 'execute' as normal for tasks and is visible from both the bokeh dashboard as well as from the coiled dashboard.
This already shows interesting data - e.g. how much it takes to break each partition into shards and serialize them. Unsurprisingly, you will observe that most of the time is spent waiting on comms ResourceLimiter.wait_for_available method - in other words, waiting for the background task to free up.
Shuffle metrics 2/4: Add background metrics #8365
This adds Fine Performance Metrics to the background tasks, but they end up lost in the void as there are no callbacks capturing them.
Shuffle metrics 3/4: Capture background metrics #8366
This captures the Fine Performance Metrics for the background tasks and attributes them to spans. They are stored under a new metrics context tag (the first element of the tuple), shuffle, that goes alongside execute, gather-dep, get-data and memory-monitor.

Additionally, this PR duplicates all foreground metrics so that they appear both under shuffle and under execute. This is to facilitate performance analysis. Metrics are separated into sub-sections

(shuffle, foreground, ...)
(shuffle, background-comms, ...)
(shuffle, background-disk, ...)

A noteworthy difference is that shuffle, foreground metrics always retain disaggregated activities, whereas execute metrics are collapsed into cancelled or failed activities - the former happens quite frequently and it's something that should be investigated.

Shuffle metrics 4/4: Remove bespoke diagnostics #8367
This PR finally cleans up all the bespoke shuffle diagnostics, which should now be redundant.

A/B test shows no performance impact.

This introduces a minor regression in the Coiled dashboard:

https://github.com/coiled/platform/issues/3898

Out of scope

Visualize the background metrics on the bokeh dashboard
Visualize the background metrics on the coiled dashboard (IMHO we shouldn't; this is a nitty gritty internal thing)
Extract insights from coiled/benchmarks (everything should be recorded on the Coiled database; cluster IDs are available here https://github.com/coiled/benchmarks/actions/runs/6917401769 in benchmarks.db)

github-actions bot added the needs triage label Jun 22, 2023

fjetter added enhancement Improve existing functionality or make things work better diagnostics and removed needs triage labels Jun 27, 2023

hendrikmakait mentioned this issue Jul 28, 2023

[Tracking] Advancements for P2P #8043

Open

15 tasks

crusaderky self-assigned this Nov 13, 2023

This was referenced Nov 15, 2023

captured_context_meter #8352

Merged

context_meter.clear_callbacks #8353

Merged

Test that sync() propagates contextvars #8354

Merged

Add offload support to context_meter.add_callback #8360

Merged

This was referenced Nov 19, 2023

Shuffle metrics 1/4: Add foreground metrics #8364

Merged

Shuffle metrics 2/4: Add background metrics #8365

Merged

Shuffle metrics 3/4: Capture background metrics #8366

Merged

Shuffle metrics 4/4: Remove bespoke diagnostics #8367

Merged

crusaderky closed this as completed in #8364 Dec 20, 2023

crusaderky reopened this Dec 20, 2023

hendrikmakait closed this as completed in #8367 Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use metering for P2P shuffling instrumentation #7943

Use metering for P2P shuffling instrumentation #7943

fjetter commented Jun 22, 2023

crusaderky commented Nov 19, 2023 •

edited

Loading

Use metering for P2P shuffling instrumentation #7943

Use metering for P2P shuffling instrumentation #7943

Comments

fjetter commented Jun 22, 2023

crusaderky commented Nov 19, 2023 • edited Loading

Out of scope

crusaderky commented Nov 19, 2023 •

edited

Loading