Follow-up 1318: Fix QualX fallback with default speedup and duration columns #1330

parthosa · 2024-09-04T17:05:51Z

As a follow up of #1318, this PR fixes the fallbacks from QualX by adding the estimated speedup and duration columns with default values.

Additionally, if any errors occur while creating these columns, the tool’s execution will be stopped, as it relies on speedup values to proceed.

Docs

No changes need in documentation as this is an internal change

Test

Tested manually and added a test case in the E2E testing.

Signed-off-by: Partho Sarthi <[email protected]>

amahussein

I am concerned with the frequent changes done in handling errors/exceptions. Those changes alter the behavior of the tools. It is making reviewing and testing more difficult.

We need to be more conservative making those changes in the python side. Sometimes, those behavior could be more risky than changing columns.

amahussein · 2024-09-04T18:07:43Z

user_tools/src/spark_rapids_tools/tools/qualx/qualx_main.py

+    except ValueError:  # pylint: disable=try-except-raise
+        raise


What's the different scenario that could raise ScanTblError? and why we will have two different handling for each error type?

The ScanTblError is used when expected P tool CSV files are missing (or fail to load). However, it looks like it may not be used any more, since abort_on_error is never set to True AFAICT.

Removed ScanTblError as it is not thrown anywhere

amahussein · 2024-09-04T18:13:06Z

user_tools/src/spark_rapids_tools/tools/qualx/qualx_main.py

@@ -582,15 +582,15 @@ def predict(
            qual_tool_filter=qual_tool_filter,
            qual_tool_output=qual_tool_output
        )
+        if profile_df.empty:
+            raise ValueError('Data preprocessing resulted in an empty dataset. Speedup predictions will be skipped.')


Speedup predictions will be skipped. : If we end up showing that the speedup us is 1.0. then the word skipped will be confusing. If the tools that something is "skipped", then it won't be in the results.
Does profile_df represent the raw_metrics information of all the apps? In real world, we face the situation that it could be missing for some apps that don't have metrics /SQL. How does the code change affect the most common scenario of having mix-match

It looks like this is just restructuring the code to catch the appId key error on L588 for empty dataframes. In theory, we could just use the following instead (if we want to keep the original structure):

if not profile_df.empty: # reset appName to original profile_df['appName'] = profile_df['appId'].map(app_id_name_map)

However, would be a bit redundant with another empty check outside of the try/except block.

As for the error message, we can just change it from "Speedup predictions will be skipped" to "Speedup predictions will default to 1.0".

Otherwise, LGTM

Speedup predictions will be skipped. : If we end up showing that the speedup us is 1.0. then the word skipped will be confusing.

Reworded the log statement to Speedup predictions will default to 1.0.

In real world, we face the situation that it could be missing for some apps that don't have metrics /SQL. How does the code change affect the most common scenario of having mix-match

QualX Behavior

Scenario Metric Processing Speedup Prediction

(1) Valid Event Log Processes profiler metrics Predicts speedup

(2) Event Log without profiler metrics Cannot read profiler metrics Does not predict speedup

(3) Event Log with only unsupported execs Does not process profiler metrics Does not predict speedup

Combination Metric Processing Speedup Prediction

(1) + (2) Processes profiler metrics for apps in (1) - Predicts speedups for apps in (1)
- Assigns a default speedup of 1.0 for apps in (2)

(1) + (3) Processes profiler metrics for apps in (1) + (3) - Predicts speedups for apps in (1)
- Calculates a speedup of 1.0 for apps in (3) (since fraction_supported = 0)

(2) + (3) - Cannot read profiler metrics for (2)
- Does not process profiler metrics for (3) Does not predict speedup

(1) + (2) + (3) Processes profiler metrics for apps in (1) + (3) - Predicts speedups for apps in (1)
- Calculates a speedup of 1.0 for apps in (3) (since fraction_supported = 0)
- Assigns a default speedup of 1.0 for apps in (2)

This PR adds the fix for only (2), only (3) and combo (2) + (3) (ones in underline) so that we have a default speedup of 1.0 in all cases.

@leewyang For future, here are my thoughts:

Instead of early returns (error or empty df), we should modify qualx so that it returns speed up for all apps

If we have this guarantee then there should be no cases of fallbacks being missed.

I think we were already backfilling any missing predictions w/ 1.0 defaults, except that in some (presumably rare) cases, we were returning empty dataframes to highlight major/unexpected issues. Theoretically, we could just return the default_preds_df for those cases now.

@leewyang
There are many internal functions in QualX that raise ValueError. So we will have to keep a try-except in qual tool to backfill missing predictions in any case.

@amahussein

How does the code change affect the most common scenario of having mix-match

In case of mix-match, currently QualX backfills the null/NA values. This PR does not affect mix-match cases. It adds handling only for the case when no result was generated from QualX.

Signed-off-by: Partho Sarthi <[email protected]>

parthosa · 2024-09-04T21:30:59Z

I am concerned with the frequent changes done in handling errors/exceptions. Those changes alter the behavior of the tools. It is making reviewing and testing more difficult.

The change in error handling is because the behavior of tool changed. Previously, the tool included legacy “Speedup” and “Duration” columns, allowing it to continue even if QualX failed.

However, since these columns are no longer present, the tool cannot proceed if adding these columns from Qualx (predicted or default) fails.

Handle fallbacks from QualX

003a667

Signed-off-by: Partho Sarthi <[email protected]>

parthosa added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Sep 4, 2024

parthosa self-assigned this Sep 4, 2024

parthosa changed the title ~~Follow-up 1318: Handle fallbacks from QualX~~ Follow-up 1318: Fix QualX fallback with default speedup and duration columns Sep 4, 2024

parthosa marked this pull request as ready for review September 4, 2024 17:18

parthosa requested review from leewyang and amahussein September 4, 2024 17:18

Fix comment

b0b8dbf

Signed-off-by: Partho Sarthi <[email protected]>

amahussein reviewed Sep 4, 2024

View reviewed changes

Remove try-except for ScanTblError

b20693d

Signed-off-by: Partho Sarthi <[email protected]>

parthosa requested a review from amahussein September 4, 2024 21:39

leewyang approved these changes Sep 5, 2024

View reviewed changes

amahussein merged commit 277e951 into NVIDIA:dev Sep 6, 2024
14 checks passed

parthosa deleted the spark-rapids-tools-1199-handle-fallbacks branch October 9, 2024 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow-up 1318: Fix QualX fallback with default speedup and duration columns #1330

Follow-up 1318: Fix QualX fallback with default speedup and duration columns #1330

parthosa commented Sep 4, 2024

amahussein left a comment

amahussein Sep 4, 2024

leewyang Sep 4, 2024

parthosa Sep 4, 2024

amahussein Sep 4, 2024

leewyang Sep 4, 2024 •

edited

Loading

parthosa Sep 4, 2024 •

edited

Loading

leewyang Sep 4, 2024 •

edited

Loading

parthosa Sep 4, 2024

parthosa commented Sep 4, 2024

Scenario	Metric Processing	Speedup Prediction
(1) Valid Event Log	Processes profiler metrics	Predicts speedup
(2) Event Log without profiler metrics	Cannot read profiler metrics	Does not predict speedup
(3) Event Log with only unsupported execs	Does not process profiler metrics	Does not predict speedup

Combination	Metric Processing	Speedup Prediction
(1) + (2)	Processes profiler metrics for apps in (1)	- Predicts speedups for apps in (1) - Assigns a default speedup of `1.0` for apps in (2)
(1) + (3)	Processes profiler metrics for apps in (1) + (3)	- Predicts speedups for apps in (1) - Calculates a speedup of `1.0` for apps in (3) (since `fraction_supported = 0`)
(2) + (3)	- Cannot read profiler metrics for (2) - Does not process profiler metrics for (3)	Does not predict speedup
(1) + (2) + (3)	Processes profiler metrics for apps in (1) + (3)	- Predicts speedups for apps in (1) - Calculates a speedup of `1.0` for apps in (3) (since `fraction_supported = 0`) - Assigns a default speedup of `1.0` for apps in (2)

Follow-up 1318: Fix QualX fallback with default speedup and duration columns #1330

Follow-up 1318: Fix QualX fallback with default speedup and duration columns #1330

Conversation

parthosa commented Sep 4, 2024

Docs

Test

amahussein left a comment

Choose a reason for hiding this comment

amahussein Sep 4, 2024

Choose a reason for hiding this comment

leewyang Sep 4, 2024

Choose a reason for hiding this comment

parthosa Sep 4, 2024

Choose a reason for hiding this comment

amahussein Sep 4, 2024

Choose a reason for hiding this comment

leewyang Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

parthosa Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

QualX Behavior

leewyang Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

parthosa Sep 4, 2024

Choose a reason for hiding this comment

parthosa commented Sep 4, 2024

leewyang Sep 4, 2024 •

edited

Loading

parthosa Sep 4, 2024 •

edited

Loading

leewyang Sep 4, 2024 •

edited

Loading