Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Qualification tool should look at spill metrics #477

Closed
2 tasks done
tgravescs opened this issue Aug 4, 2023 · 3 comments · Fixed by #1002
Closed
2 tasks done

[FEA] Qualification tool should look at spill metrics #477

tgravescs opened this issue Aug 4, 2023 · 3 comments · Fixed by #1002
Assignees
Labels
core_tools Scope the core module (scala) feature request New feature or request

Comments

@tgravescs
Copy link
Collaborator

tgravescs commented Aug 4, 2023

Is your feature request related to a problem? Please describe.
spill can be very time consuming, the qualification tool should try to take spill metrics into account.

Note, originally mentioned in #73

Tasks

  1. bug core_tools
    amahussein
  2. feature request user_tools
    parthosa
@tgravescs tgravescs added feature request New feature or request ? - Needs Triage labels Aug 4, 2023
@mattahrens mattahrens added core_tools Scope the core module (scala) and removed ? - Needs Triage labels Aug 8, 2023
@viadea
Copy link
Collaborator

viadea commented Oct 6, 2023

To add some context, I wish Qualification tool can consider both of below factors:

  1. "Spilling to disks" "Spilling to memory" "Shuffle read" and "Shuffle write" metrics
  2. Cluster shape to figure out the local disk bandwidth per executor

For example, Standard_NC8as_T4_v3 has 240MB/s combined read/write for each local disk per Spark executor;
While a Dataproc node with 2 x T4s and 8 local NVMEs per node, has 1.4GB/s write throughput for local disks per Spark executor.

If the cluster shape information is provided as the input for Qualification tool, it should figure out a way to measure how much negative impact based on above 2 factors.

@tgravescs
Copy link
Collaborator Author

The problem is that just because CPU spills doesn't mean GPU is going to spill and vise versa, cpu might not spill but GPU does. A lot of factors go into this. This is a very hard problem. Maybe we can do some heuristics about if spill on CPU is over certain threshold GPU will likely spill too, or we need to look at other characteristics/metrics in the job with cluster, but that definitely gets complex.

@amahussein amahussein assigned amahussein and unassigned nartal1 Apr 18, 2024
@amahussein
Copy link
Collaborator

We discussed offline the initial steps to tackle this issue:

  • Make the qualification tool capture spill metrics per stage
  • Identify stages that exhibit spills
  • Any stage that has spills and they do not have a join agg operator needs to be flagged.

amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue Apr 29, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Contributes to NVIDIA#477

This code change aims at bringing the Q/P tools handling of stages and
their accumulator to a common ground.
There is a couple of fixes done in this code change including:
- Capture accumulator IDs of a stage during a stage completion event.
- Fix the construction of MLFunctions
- Fix the implementation of `jobAndStageMetricsAggregation` which was
  not efficient in iterating multiple times of the tasks list.
- Remove redundant Data structure that maps between accumulators and
  stages.
amahussein added a commit that referenced this issue Apr 30, 2024
* Refactor Stage info code between Q/P tools

Contributes to #477

This code change aims at bringing the Q/P tools handling of stages and
their accumulator to a common ground.
There is a couple of fixes done in this code change including:
- Capture accumulator IDs of a stage during a stage completion event.
- Fix the construction of MLFunctions
- Fix the implementation of `jobAndStageMetricsAggregation` which was
  not efficient in iterating multiple times of the tasks list.
- Remove redundant Data structure that maps between accumulators and
  stages.

* Remove unused class StageInfoClass
* Move StageModelManager to a separate scala class file

---------

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
@parthosa parthosa self-assigned this May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants