Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heuristics using stage spill metrics to skip apps #1002

Merged

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented May 8, 2024

Fixes #477. This PR adds a generic Additional Heuristics module to skip recommending apps based on heuristics. It introduces an additional column Skip By Heuristics in qualification summary file. This logic will be applied only if the user tools is run with --estimation_model xgboost since it uses profiler output.

Changes:

Added a specific heuristic to skip applications based on spill metrics:

  1. Check if profiler output is present, else skip the heuristics logic.
  2. For each application:
    • Using job_+_stage_level_aggregated_task_metrics.csv Identify stages with spills greater than a threshold
    • Using sql_to_stage_information.csv, check if above spill stages have Execs other than the ones allowed (Join, Aggregate or Sort)
    • If there are stages with significant spills and the spills is from Execs other than the ones allowed
      => Column Skip By Heuristics would be True for this application.
  3. Finally, while calculating SpeedUp Category, if Skip By Heuristics is True for the application, set the category to Not Recommended.

Things to discuss:

  1. Spill Threshold is currently set to 10 GB (configurable in qualification-conf.yaml
  2. We will read 3 csv per app. This should not be a bottleneck as Core tools would still take majority of time.
    • QualX reads ~10 csvs per app.
  3. Corner case:
    • There is an application that is re-run and we test for both eventlogs together
    • For TCO, we will group them by name, in this case the col Skip By Heuristics will be aggregated using any() function.
    • So, if any one of the run was skipped, the grouped application will also be skipped.

Steps to Evaluate:

  • Manually set "Memory Bytes Spilled":50000000' in the event SparkListenerTaskEnd for certain stages in any test event log.

@parthosa parthosa added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels May 8, 2024
@parthosa parthosa self-assigned this May 8, 2024
@tgravescs
Copy link
Collaborator

does this have any output saying which execs/stages/sqlid cause this to be skipped?
Does this have output to explain why it should be skipped?

Did we generate any test eventlogs/queries that can be used to continue integration testing this or that could also be used by qualx to train on this scenario?

It seems like this should be done in Java, is there a followup to move it there?

@parthosa
Copy link
Collaborator Author

parthosa commented May 9, 2024

does this have any output saying which execs/stages/sqlid cause this to be skipped?
Does this have output to explain why it should be skipped?

There is no output associated with it. A column Skip By Heuristics Reason can be added that mentions the details.
StageId <stage_id> had <spill_size> spill

Did we generate any test eventlogs/queries that can be used to continue integration testing this or that could also be used by qualx to train on this scenario?

I have test event logs that I used to test this scenario. This can be added in the integration testing. I will include this as part of improving E2E tools testing #970

It seems like this should be done in Java, is there a followup to move it there?

Yes, currently we need metrics from Profiling tool for this estimate. Once the merging of Profiling/Qualification tool is done, we will migrate this to Java/Scala side. Created an issue to track this #1008

Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
@parthosa parthosa requested a review from tgravescs May 14, 2024 05:49
@parthosa parthosa marked this pull request as draft May 14, 2024 18:58
@parthosa
Copy link
Collaborator Author

Changes

  1. Create a new directory intermediate_output to store all intermediate output generated by user tools. We should avoid putting too much information in the qualification_summary.csv
  2. Create a new file heuristics_info.csv in the above directory to store [App ID,Skip by Heuristics,Reason]

Reasons

There could be two potential reasons:

  1. Spilling occurred - We should skip the app based on heuristics
  2. Profiler did not generate relevant output for the app - We should not skip the app based on heuristics (other reasons may still be applied)

Output Covering both cases:

File: intermediate_output/heuristics_info.csv

|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| App ID                         | Skip by Heuristics | Reason                                                                                                                                            |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| application_1686676198636_0003 | True               | Skipping due to spills in stages [39; 41; 40] exceeding 1000000000 bytes                                                                          |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| application_1686676198636_0002 | False              |                                                                                                                                                   |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| app-20231212214826-0000        | False              | Cannot apply heuristics for qualification. Reason - FileNotFoundError:[Errno 2] No such file or directory: '/<path>/sql_to_stage_information.csv' |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| app-20240312004226-0000        | True               | Skipping due to spills in stages [60; 58] exceeding 1000000000 bytes                                                                              |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| app-20240312023625-0000        | False              |                                                                                                                                                   |
|--------------------------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|


Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattahrens Are you ok with the PR as a temp work around until the heuristics are implemented in Scala module?

@mattahrens
Copy link
Collaborator

Yes, I'm fine with it. 👍

amahussein
amahussein previously approved these changes May 15, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTME.
Thanks @parthosa

cindyyuanjiang
cindyyuanjiang previously approved these changes May 16, 2024
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa! Just a minor nit.
I am also wondering in the output file: is 1000000000 bytes or 10 GB more clear?

@parthosa
Copy link
Collaborator Author

I am also wondering in the output file: is 1000000000 bytes or 10 GB more clear?

Thanks @cindyyuanjiang. I think 10 GB would be more clear. Added a function to convert bytes to human readable format. We have the following reason now:

App ID,Skip by Heuristics,Reason
app-20240312004226-0000,True,Skipping due to spills in stages [60; 58] exceeding 10.00 GB

Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa! LGTM.

@parthosa parthosa merged commit 4f592ce into NVIDIA:dev May 16, 2024
15 checks passed
@parthosa parthosa deleted the spark-rapids-tools-477-skip-using-spill-metrics branch May 16, 2024 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Qualification tool should look at spill metrics
5 participants