[FEA] Qualification tool should look at spill metrics #477

tgravescs · 2023-08-04T19:19:57Z

Is your feature request related to a problem? Please describe.
spill can be very time consuming, the qualification tool should try to take spill metrics into account.

Note, originally mentioned in #73

Tasks

Give feedback

Refactor Stage info code between Q/P tools #971

bug core_tools
Add heuristics using stage spill metrics to skip apps #1002

feature request user_tools
Options

viadea · 2023-10-06T20:55:09Z

To add some context, I wish Qualification tool can consider both of below factors:

"Spilling to disks" "Spilling to memory" "Shuffle read" and "Shuffle write" metrics
Cluster shape to figure out the local disk bandwidth per executor

For example, Standard_NC8as_T4_v3 has 240MB/s combined read/write for each local disk per Spark executor;
While a Dataproc node with 2 x T4s and 8 local NVMEs per node, has 1.4GB/s write throughput for local disks per Spark executor.

If the cluster shape information is provided as the input for Qualification tool, it should figure out a way to measure how much negative impact based on above 2 factors.

tgravescs · 2023-10-06T21:59:37Z

The problem is that just because CPU spills doesn't mean GPU is going to spill and vise versa, cpu might not spill but GPU does. A lot of factors go into this. This is a very hard problem. Maybe we can do some heuristics about if spill on CPU is over certain threshold GPU will likely spill too, or we need to look at other characteristics/metrics in the job with cluster, but that definitely gets complex.

amahussein · 2024-04-25T16:44:51Z

We discussed offline the initial steps to tackle this issue:

Make the qualification tool capture spill metrics per stage
Identify stages that exhibit spills
Any stage that has spills and they do not have a join agg operator needs to be flagged.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Contributes to NVIDIA#477 This code change aims at bringing the Q/P tools handling of stages and their accumulator to a common ground. There is a couple of fixes done in this code change including: - Capture accumulator IDs of a stage during a stage completion event. - Fix the construction of MLFunctions - Fix the implementation of `jobAndStageMetricsAggregation` which was not efficient in iterating multiple times of the tasks list. - Remove redundant Data structure that maps between accumulators and stages.

* Refactor Stage info code between Q/P tools Contributes to #477 This code change aims at bringing the Q/P tools handling of stages and their accumulator to a common ground. There is a couple of fixes done in this code change including: - Capture accumulator IDs of a stage during a stage completion event. - Fix the construction of MLFunctions - Fix the implementation of `jobAndStageMetricsAggregation` which was not efficient in iterating multiple times of the tasks list. - Remove redundant Data structure that maps between accumulators and stages. * Remove unused class StageInfoClass * Move StageModelManager to a separate scala class file --------- Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

tgravescs added feature request New feature or request ? - Needs Triage labels Aug 4, 2023

mattahrens added core_tools Scope the core module (scala) and removed ? - Needs Triage labels Aug 8, 2023

mattahrens assigned nartal1 Aug 11, 2023

amahussein assigned amahussein and unassigned nartal1 Apr 18, 2024

amahussein mentioned this issue Apr 29, 2024

Refactor Stage info code between Q/P tools #971

Merged

parthosa mentioned this issue May 8, 2024

Add heuristics using stage spill metrics to skip apps #1002

Merged

parthosa self-assigned this May 9, 2024

parthosa closed this as completed in #1002 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Qualification tool should look at spill metrics #477

[FEA] Qualification tool should look at spill metrics #477

tgravescs commented Aug 4, 2023 •

edited by parthosa

Loading

Tasks

viadea commented Oct 6, 2023

tgravescs commented Oct 6, 2023

amahussein commented Apr 25, 2024

[FEA] Qualification tool should look at spill metrics #477

[FEA] Qualification tool should look at spill metrics #477

Comments

tgravescs commented Aug 4, 2023 • edited by parthosa Loading

Tasks

viadea commented Oct 6, 2023

tgravescs commented Oct 6, 2023

amahussein commented Apr 25, 2024

tgravescs commented Aug 4, 2023 •

edited by parthosa

Loading