Add heuristics using stage spill metrics to skip apps #1002

parthosa · 2024-05-08T22:59:34Z

Fixes #477. This PR adds a generic Additional Heuristics module to skip recommending apps based on heuristics. It introduces an additional column Skip By Heuristics in qualification summary file. This logic will be applied only if the user tools is run with --estimation_model xgboost since it uses profiler output.

Changes:

Added a specific heuristic to skip applications based on spill metrics:

Check if profiler output is present, else skip the heuristics logic.
For each application:
- Using job_+_stage_level_aggregated_task_metrics.csv Identify stages with spills greater than a threshold
- Using sql_to_stage_information.csv, check if above spill stages have Execs other than the ones allowed (Join, Aggregate or Sort)
- If there are stages with significant spills and the spills is from Execs other than the ones allowed
  => Column Skip By Heuristics would be True for this application.
Finally, while calculating SpeedUp Category, if Skip By Heuristics is True for the application, set the category to Not Recommended.

Things to discuss:

Spill Threshold is currently set to 10 GB (configurable in qualification-conf.yaml
We will read 3 csv per app. This should not be a bottleneck as Core tools would still take majority of time.
- QualX reads ~10 csvs per app.
Corner case:
- There is an application that is re-run and we test for both eventlogs together
- For TCO, we will group them by name, in this case the col Skip By Heuristics will be aggregated using any() function.
- So, if any one of the run was skipped, the grouped application will also be skipped.

Steps to Evaluate:

Manually set "Memory Bytes Spilled":50000000' in the event SparkListenerTaskEnd for certain stages in any test event log.

Signed-off-by: Partho Sarthi <[email protected]>

user_tools/src/spark_rapids_tools/tools/additional_heuristics.py

tgravescs · 2024-05-09T13:18:42Z

does this have any output saying which execs/stages/sqlid cause this to be skipped?
Does this have output to explain why it should be skipped?

Did we generate any test eventlogs/queries that can be used to continue integration testing this or that could also be used by qualx to train on this scenario?

It seems like this should be done in Java, is there a followup to move it there?

parthosa · 2024-05-09T17:44:56Z

does this have any output saying which execs/stages/sqlid cause this to be skipped?
Does this have output to explain why it should be skipped?

There is no output associated with it. A column Skip By Heuristics Reason can be added that mentions the details.
StageId <stage_id> had <spill_size> spill

Did we generate any test eventlogs/queries that can be used to continue integration testing this or that could also be used by qualx to train on this scenario?

I have test event logs that I used to test this scenario. This can be added in the integration testing. I will include this as part of improving E2E tools testing #970

It seems like this should be done in Java, is there a followup to move it there?

Yes, currently we need metrics from Profiling tool for this estimate. Once the merging of Profiling/Qualification tool is done, we will migrate this to Java/Scala side. Created an issue to track this #1008

Signed-off-by: Partho Sarthi <[email protected]>

This reverts commit b774958.

Signed-off-by: Partho Sarthi <[email protected]>

parthosa · 2024-05-14T23:06:45Z

Changes

Create a new directory intermediate_output to store all intermediate output generated by user tools. We should avoid putting too much information in the qualification_summary.csv
Create a new file heuristics_info.csv in the above directory to store [App ID,Skip by Heuristics,Reason]

Reasons

There could be two potential reasons:

Spilling occurred - We should skip the app based on heuristics
Profiler did not generate relevant output for the app - We should not skip the app based on heuristics (other reasons may still be applied)

Output Covering both cases:

File: intermediate_output/heuristics_info.csv

|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| App ID                         | Skip by Heuristics | Reason                                                                                                                                            |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| application_1686676198636_0003 | True               | Skipping due to spills in stages [39; 41; 40] exceeding 1000000000 bytes                                                                          |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| application_1686676198636_0002 | False              |                                                                                                                                                   |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| app-20231212214826-0000        | False              | Cannot apply heuristics for qualification. Reason - FileNotFoundError:[Errno 2] No such file or directory: '/<path>/sql_to_stage_information.csv' |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| app-20240312004226-0000        | True               | Skipping due to spills in stages [60; 58] exceeding 1000000000 bytes                                                                              |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| app-20240312023625-0000        | False              |                                                                                                                                                   |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|

amahussein

@mattahrens Are you ok with the PR as a temp work around until the heuristics are implemented in Scala module?

mattahrens · 2024-05-15T16:42:35Z

Yes, I'm fine with it. 👍

amahussein

LGTME.
Thanks @parthosa

user_tools/src/spark_rapids_tools/tools/speedup_category.py

cindyyuanjiang

Thanks @parthosa! Just a minor nit.
I am also wondering in the output file: is 1000000000 bytes or 10 GB more clear?

Signed-off-by: Partho Sarthi <[email protected]>

parthosa · 2024-05-16T18:07:52Z

I am also wondering in the output file: is 1000000000 bytes or 10 GB more clear?

Thanks @cindyyuanjiang. I think 10 GB would be more clear. Added a function to convert bytes to human readable format. We have the following reason now:

App ID,Skip by Heuristics,Reason
app-20240312004226-0000,True,Skipping due to spills in stages [60; 58] exceeding 10.00 GB

cindyyuanjiang

Thanks @parthosa! LGTM.

Add heuristics using stage spill metrics to skip apps

d8b3a64

Signed-off-by: Partho Sarthi <[email protected]>

parthosa added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels May 8, 2024

parthosa requested review from mattahrens, cindyyuanjiang, amahussein and nartal1 May 8, 2024 22:59

parthosa self-assigned this May 8, 2024

parthosa mentioned this pull request May 8, 2024

[FEA] Qualification tool should look at spill metrics #477

Closed

tgravescs reviewed May 9, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/tools/additional_heuristics.py Outdated Show resolved Hide resolved

parthosa added 2 commits May 9, 2024 16:31

Remove disk spill metrics

ca21dd2

Signed-off-by: Partho Sarthi <[email protected]>

Add skip reason

b774958

Signed-off-by: Partho Sarthi <[email protected]>

parthosa requested a review from tgravescs May 14, 2024 05:49

parthosa marked this pull request as draft May 14, 2024 18:58

parthosa added 4 commits May 14, 2024 14:25

Revert "Add skip reason"

f3987c5

This reverts commit b774958.

Generate skip reason to intermediate output directory

75ac4e8

Signed-off-by: Partho Sarthi <[email protected]>

Merge branch 'dev' into spark-rapids-tools-477-skip-using-spill-metrics

608770a

Change delimiter to semi colon and update reason column name

16435a3

Signed-off-by: Partho Sarthi <[email protected]>

parthosa marked this pull request as ready for review May 14, 2024 23:15

parthosa mentioned this pull request May 14, 2024

Store Cluster Shape Recommendation in User Tools Qualification Output #1005

Merged

amahussein reviewed May 15, 2024

View reviewed changes

amahussein previously approved these changes May 15, 2024

View reviewed changes

cindyyuanjiang reviewed May 16, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/tools/speedup_category.py Show resolved Hide resolved

cindyyuanjiang previously approved these changes May 16, 2024

View reviewed changes

Add function to convert size to human-readable format

a3a7551

Signed-off-by: Partho Sarthi <[email protected]>

parthosa dismissed cindyyuanjiang’s stale review via a3a7551 May 16, 2024 18:06

parthosa dismissed amahussein’s stale review via a3a7551 May 16, 2024 18:06

parthosa requested a review from cindyyuanjiang May 16, 2024 18:28

cindyyuanjiang approved these changes May 16, 2024

View reviewed changes

parthosa merged commit 4f592ce into NVIDIA:dev May 16, 2024
15 checks passed

parthosa deleted the spark-rapids-tools-477-skip-using-spill-metrics branch May 16, 2024 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add heuristics using stage spill metrics to skip apps #1002

Add heuristics using stage spill metrics to skip apps #1002

parthosa commented May 8, 2024 •

edited

Loading

tgravescs commented May 9, 2024

parthosa commented May 9, 2024 •

edited

Loading

parthosa commented May 14, 2024

amahussein left a comment

mattahrens commented May 15, 2024

amahussein left a comment

cindyyuanjiang left a comment •

edited

Loading

parthosa commented May 16, 2024

cindyyuanjiang left a comment

Add heuristics using stage spill metrics to skip apps #1002

Add heuristics using stage spill metrics to skip apps #1002

Conversation

parthosa commented May 8, 2024 • edited Loading

Changes:

Things to discuss:

Steps to Evaluate:

tgravescs commented May 9, 2024

parthosa commented May 9, 2024 • edited Loading

parthosa commented May 14, 2024

Changes

Reasons

Output Covering both cases:

amahussein left a comment

Choose a reason for hiding this comment

mattahrens commented May 15, 2024

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang left a comment • edited Loading

Choose a reason for hiding this comment

parthosa commented May 16, 2024

cindyyuanjiang left a comment

Choose a reason for hiding this comment

parthosa commented May 8, 2024 •

edited

Loading

parthosa commented May 9, 2024 •

edited

Loading

cindyyuanjiang left a comment •

edited

Loading