-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Follow-up 1318: Fix QualX fallback with default speedup and duration columns #1330
Follow-up 1318: Fix QualX fallback with default speedup and duration columns #1330
Conversation
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned with the frequent changes done in handling errors/exceptions. Those changes alter the behavior of the tools. It is making reviewing and testing more difficult.
We need to be more conservative making those changes in the python side. Sometimes, those behavior could be more risky than changing columns.
except ValueError: # pylint: disable=try-except-raise | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the different scenario that could raise ScanTblError
? and why we will have two different handling for each error type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ScanTblError
is used when expected P tool CSV files are missing (or fail to load). However, it looks like it may not be used any more, since abort_on_error
is never set to True AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed ScanTblError
as it is not thrown anywhere
@@ -582,15 +582,15 @@ def predict( | |||
qual_tool_filter=qual_tool_filter, | |||
qual_tool_output=qual_tool_output | |||
) | |||
if profile_df.empty: | |||
raise ValueError('Data preprocessing resulted in an empty dataset. Speedup predictions will be skipped.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speedup predictions will be skipped.
: If we end up showing that the speedup us is 1.0. then the word skipped
will be confusing. If the tools that something is "skipped", then it won't be in the results.
Does profile_df
represent the raw_metrics information of all the apps? In real world, we face the situation that it could be missing for some apps that don't have metrics /SQL. How does the code change affect the most common scenario of having mix-match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is just restructuring the code to catch the appId
key error on L588 for empty dataframes. In theory, we could just use the following instead (if we want to keep the original structure):
if not profile_df.empty:
# reset appName to original
profile_df['appName'] = profile_df['appId'].map(app_id_name_map)
However, would be a bit redundant with another empty check outside of the try/except block.
As for the error message, we can just change it from "Speedup predictions will be skipped" to "Speedup predictions will default to 1.0".
Otherwise, LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speedup predictions will be skipped. : If we end up showing that the speedup us is 1.0. then the word skipped will be confusing.
Reworded the log statement to Speedup predictions will default to 1.0.
In real world, we face the situation that it could be missing for some apps that don't have metrics /SQL. How does the code change affect the most common scenario of having mix-match
QualX Behavior
Scenario | Metric Processing | Speedup Prediction |
---|---|---|
(1) Valid Event Log | Processes profiler metrics | Predicts speedup |
(2) Event Log without profiler metrics | Cannot read profiler metrics | Does not predict speedup |
(3) Event Log with only unsupported execs | Does not process profiler metrics | Does not predict speedup |
Combination | Metric Processing | Speedup Prediction |
---|---|---|
(1) + (2) | Processes profiler metrics for apps in (1) | - Predicts speedups for apps in (1) - Assigns a default speedup of 1.0 for apps in (2) |
(1) + (3) | Processes profiler metrics for apps in (1) + (3) | - Predicts speedups for apps in (1) - Calculates a speedup of 1.0 for apps in (3) (since fraction_supported = 0 ) |
(2) + (3) | - Cannot read profiler metrics for (2) - Does not process profiler metrics for (3) |
Does not predict speedup |
(1) + (2) + (3) | Processes profiler metrics for apps in (1) + (3) | - Predicts speedups for apps in (1) - Calculates a speedup of 1.0 for apps in (3) (since fraction_supported = 0 ) - Assigns a default speedup of 1.0 for apps in (2) |
This PR adds the fix for only (2), only (3) and combo (2) + (3) (ones in underline) so that we have a default speedup of 1.0
in all cases.
@leewyang For future, here are my thoughts:
- Instead of early returns (error or empty df), we should modify qualx so that it returns speed up for all apps
- If we have this guarantee then there should be no cases of fallbacks being missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we were already backfilling any missing predictions w/ 1.0 defaults, except that in some (presumably rare) cases, we were returning empty dataframes to highlight major/unexpected issues. Theoretically, we could just return the default_preds_df
for those cases now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leewyang
There are many internal functions in QualX that raise ValueError. So we will have to keep a try-except
in qual tool to backfill missing predictions in any case.
How does the code change affect the most common scenario of having mix-match
In case of mix-match, currently QualX backfills the null/NA
values. This PR does not affect mix-match cases. It adds handling only for the case when no result was generated from QualX.
Signed-off-by: Partho Sarthi <[email protected]>
The change in error handling is because the behavior of tool changed. Previously, the tool included legacy “Speedup” and “Duration” columns, allowing it to continue even if QualX failed. However, since these columns are no longer present, the tool cannot proceed if adding these columns from Qualx (predicted or default) fails. |
As a follow up of #1318, this PR fixes the fallbacks from QualX by adding the estimated speedup and duration columns with default values.
Additionally, if any errors occur while creating these columns, the tool’s execution will be stopped, as it relies on speedup values to proceed.
Docs
No changes need in documentation as this is an internal change
Test
Tested manually and added a test case in the E2E testing.