Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up 1318: Fix QualX fallback with default speedup and duration columns #1330

Merged

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Sep 4, 2024

As a follow up of #1318, this PR fixes the fallbacks from QualX by adding the estimated speedup and duration columns with default values.

Additionally, if any errors occur while creating these columns, the tool’s execution will be stopped, as it relies on speedup values to proceed.

Docs

No changes need in documentation as this is an internal change

Test

Tested manually and added a test case in the E2E testing.

Signed-off-by: Partho Sarthi <[email protected]>
@parthosa parthosa added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Sep 4, 2024
@parthosa parthosa self-assigned this Sep 4, 2024
@parthosa parthosa changed the title Follow-up 1318: Handle fallbacks from QualX Follow-up 1318: Fix QualX fallback with default speedup and duration columns Sep 4, 2024
@parthosa parthosa marked this pull request as ready for review September 4, 2024 17:18
Signed-off-by: Partho Sarthi <[email protected]>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned with the frequent changes done in handling errors/exceptions. Those changes alter the behavior of the tools. It is making reviewing and testing more difficult.

We need to be more conservative making those changes in the python side. Sometimes, those behavior could be more risky than changing columns.

Comment on lines 592 to 593
except ValueError: # pylint: disable=try-except-raise
raise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the different scenario that could raise ScanTblError? and why we will have two different handling for each error type?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ScanTblError is used when expected P tool CSV files are missing (or fail to load). However, it looks like it may not be used any more, since abort_on_error is never set to True AFAICT.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed ScanTblError as it is not thrown anywhere

@@ -582,15 +582,15 @@ def predict(
qual_tool_filter=qual_tool_filter,
qual_tool_output=qual_tool_output
)
if profile_df.empty:
raise ValueError('Data preprocessing resulted in an empty dataset. Speedup predictions will be skipped.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speedup predictions will be skipped. : If we end up showing that the speedup us is 1.0. then the word skipped will be confusing. If the tools that something is "skipped", then it won't be in the results.
Does profile_df represent the raw_metrics information of all the apps? In real world, we face the situation that it could be missing for some apps that don't have metrics /SQL. How does the code change affect the most common scenario of having mix-match

Copy link
Collaborator

@leewyang leewyang Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is just restructuring the code to catch the appId key error on L588 for empty dataframes. In theory, we could just use the following instead (if we want to keep the original structure):

if not profile_df.empty:
    # reset appName to original
    profile_df['appName'] = profile_df['appId'].map(app_id_name_map)

However, would be a bit redundant with another empty check outside of the try/except block.

As for the error message, we can just change it from "Speedup predictions will be skipped" to "Speedup predictions will default to 1.0".

Otherwise, LGTM

Copy link
Collaborator Author

@parthosa parthosa Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speedup predictions will be skipped. : If we end up showing that the speedup us is 1.0. then the word skipped will be confusing.

Reworded the log statement to Speedup predictions will default to 1.0.

In real world, we face the situation that it could be missing for some apps that don't have metrics /SQL. How does the code change affect the most common scenario of having mix-match

QualX Behavior

Scenario Metric Processing Speedup Prediction
(1) Valid Event Log Processes profiler metrics Predicts speedup
(2) Event Log without profiler metrics Cannot read profiler metrics Does not predict speedup
(3) Event Log with only unsupported execs Does not process profiler metrics Does not predict speedup
Combination Metric Processing Speedup Prediction
(1) + (2) Processes profiler metrics for apps in (1) - Predicts speedups for apps in (1)
- Assigns a default speedup of 1.0 for apps in (2)
(1) + (3) Processes profiler metrics for apps in (1) + (3) - Predicts speedups for apps in (1)
- Calculates a speedup of 1.0 for apps in (3) (since fraction_supported = 0)
(2) + (3) - Cannot read profiler metrics for (2)
- Does not process profiler metrics for (3)
Does not predict speedup
(1) + (2) + (3) Processes profiler metrics for apps in (1) + (3) - Predicts speedups for apps in (1)
- Calculates a speedup of 1.0 for apps in (3) (since fraction_supported = 0)
- Assigns a default speedup of 1.0 for apps in (2)

This PR adds the fix for only (2), only (3) and combo (2) + (3) (ones in underline) so that we have a default speedup of 1.0 in all cases.

@leewyang For future, here are my thoughts:

  • Instead of early returns (error or empty df), we should modify qualx so that it returns speed up for all apps
  • If we have this guarantee then there should be no cases of fallbacks being missed.

Copy link
Collaborator

@leewyang leewyang Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we were already backfilling any missing predictions w/ 1.0 defaults, except that in some (presumably rare) cases, we were returning empty dataframes to highlight major/unexpected issues. Theoretically, we could just return the default_preds_df for those cases now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leewyang
There are many internal functions in QualX that raise ValueError. So we will have to keep a try-except in qual tool to backfill missing predictions in any case.

@amahussein

How does the code change affect the most common scenario of having mix-match

In case of mix-match, currently QualX backfills the null/NA values. This PR does not affect mix-match cases. It adds handling only for the case when no result was generated from QualX.

@parthosa
Copy link
Collaborator Author

parthosa commented Sep 4, 2024

I am concerned with the frequent changes done in handling errors/exceptions. Those changes alter the behavior of the tools. It is making reviewing and testing more difficult.

The change in error handling is because the behavior of tool changed. Previously, the tool included legacy “Speedup” and “Duration” columns, allowing it to continue even if QualX failed.

However, since these columns are no longer present, the tool cannot proceed if adding these columns from Qualx (predicted or default) fails.

@amahussein amahussein merged commit 277e951 into NVIDIA:dev Sep 6, 2024
14 checks passed
@parthosa parthosa deleted the spark-rapids-tools-1199-handle-fallbacks branch October 9, 2024 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants