Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use metering for P2P shuffling instrumentation #7943

Closed
Tracked by #8043
fjetter opened this issue Jun 22, 2023 · 1 comment · Fixed by #8364, #8365, #8366 or #8367
Closed
Tracked by #8043

Use metering for P2P shuffling instrumentation #7943

fjetter opened this issue Jun 22, 2023 · 1 comment · Fixed by #8364, #8365, #8366 or #8367
Assignees
Labels
diagnostics enhancement Improve existing functionality or make things work better good second issue Clearly described, educational, but less trivial than "good first issue".

Comments

@fjetter
Copy link
Member

fjetter commented Jun 22, 2023

The P2P extensions (primarily the buffers) are implementing their own version for diagnostics, e.g. here

@contextlib.contextmanager
def time(self, name: str) -> Iterator[None]:
start = time()
yield
stop = time()
self.diagnostics[name] += stop - start

I think P2P diagnostics would greatly benefit from the metering / fine grained metrics and we should consider replacing (os extending) the custom ctx managers there with the metrics.meter (mind that there is also a special dashboard for P2P shuffling that is using this custom diagnostics. I don't want to break this dashboard).

cc @crusaderky @hendrikmakait

@fjetter fjetter added enhancement Improve existing functionality or make things work better diagnostics and removed needs triage labels Jun 27, 2023
@fjetter fjetter added good first issue Clearly described and easy to accomplish. Good for beginners to the project. good second issue Clearly described, educational, but less trivial than "good first issue". and removed good first issue Clearly described and easy to accomplish. Good for beginners to the project. labels Jul 5, 2023
@crusaderky crusaderky self-assigned this Nov 13, 2023
@crusaderky
Copy link
Collaborator

crusaderky commented Nov 19, 2023

I'm delivering this ticket in 4 incremental PRs for ease of review.
Each PR incorporates the previous ones. While they make sense on their own, I would encourage reviewing them holistically.

  • Shuffle metrics 1/4: Add foreground metrics #8364
    This adds Fine Performance Metrics to everything except the background tasks. Everything is logged under 'execute' as normal for tasks and is visible from both the bokeh dashboard as well as from the coiled dashboard.
    This already shows interesting data - e.g. how much it takes to break each partition into shards and serialize them. Unsurprisingly, you will observe that most of the time is spent waiting on comms ResourceLimiter.wait_for_available method - in other words, waiting for the background task to free up.
  • Shuffle metrics 2/4: Add background metrics #8365
    This adds Fine Performance Metrics to the background tasks, but they end up lost in the void as there are no callbacks capturing them.
  • Shuffle metrics 3/4: Capture background metrics #8366
    This captures the Fine Performance Metrics for the background tasks and attributes them to spans. They are stored under a new metrics context tag (the first element of the tuple), shuffle, that goes alongside execute, gather-dep, get-data and memory-monitor.

Additionally, this PR duplicates all foreground metrics so that they appear both under shuffle and under execute. This is to facilitate performance analysis. Metrics are separated into sub-sections

  • (shuffle, foreground, ...)
  • (shuffle, background-comms, ...)
  • (shuffle, background-disk, ...)

A noteworthy difference is that shuffle, foreground metrics always retain disaggregated activities, whereas execute metrics are collapsed into cancelled or failed activities - the former happens quite frequently and it's something that should be investigated.

A/B test shows no performance impact.

This introduces a minor regression in the Coiled dashboard:

Out of scope

  • Visualize the background metrics on the bokeh dashboard
  • Visualize the background metrics on the coiled dashboard (IMHO we shouldn't; this is a nitty gritty internal thing)
  • Extract insights from coiled/benchmarks (everything should be recorded on the Coiled database; cluster IDs are available here https://github.com/coiled/benchmarks/actions/runs/6917401769 in benchmarks.db)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
diagnostics enhancement Improve existing functionality or make things work better good second issue Clearly described, educational, but less trivial than "good first issue".
Projects
None yet
2 participants