Update error handling in python for parsing cluster information #1394

parthosa · 2024-10-24T17:47:31Z

Fixes #1392.

The Python code parses cluster information generated by Scala for display on STDOUT and store in app_metadata.json.

This PR ensures that new changes on the Scala side are properly handled by Python.

Error Handling Improvements:

Added a check to log an error:

If the number of worker nodes is invalid.
If the total cores per node is invalid for on-prem platforms.
If worker instance type is not set and the total cores per node is invalid for CSPs

Test

Tested on the original event logs that had caused this error.
In the given set of event logs, Scala side could not generate cluster information and would log the error message as follows:

Qualification-0 WARN QualificationAppInfo: Could not determine if any executors were allocated or the number of cores used per executor. Can't build existing cluster information!

However, the Python side would not handle this case properly and show the logs as:

ERROR rapids.tools.cluster_inference: Error while inferring cluster: cannot convert float NaN to integer
INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0009, Unable to infer CPU cluster. Reason - No matching worker node found for num cores = -4.0
INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0008, Unable to infer CPU cluster. Reason - No matching worker node found for num cores = -4.0

After this change, the logs in Python align with the behavior from Scala side.

INFO rapids.tools.cluster_inference: For App ID: application_1666141048720_0001, Unable to infer CPU cluster. Reason - Number of worker nodes cannot be determined. See logs for details.
INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0009, Unable to infer CPU cluster. Reason - Total cores per node cannot be determined. See logs for details.
INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0008, Unable to infer CPU cluster. Reason - Total cores per node cannot be determined. See logs for details.

Signed-off-by: Partho Sarthi <[email protected]>

cindyyuanjiang

Thanks @parthosa!

tgravescs · 2024-10-25T15:06:30Z

user_tools/src/spark_rapids_pytools/common/cluster_inference.py

+        # If number of worker nodes is invalid, log error and return
+        if pd.isna(num_worker_nodes) or num_worker_nodes <= 0:
+            self._log_inference_failure(app_id, 'Number of worker nodes cannot be determined. '
+                                                'See logs for details.')


my only comment here is what logs are you wanting the user to look at? Is there actual logs that are useful for user to know something... if not I would just leave "See logs for details" off.
for instance, Some of these is expected for like onprem where we don't want to tell them the number since dynamic allocation may be on.

Specifically, I want the users to refer to Scala logs since it would have the specific reason.

For example in this case, Scala logs would have

Qualification-0 WARN QualificationAppInfo: Could not determine if any executors were allocated or the number of cores used per executor. Can't build existing cluster information!

However, from a user perspective, they would see both logs from Scala and Python on the console together. I dont think we can distinguish between these and point them to Scala logs specifically.

Handle invalid values for num worker nodes and num execs per node

6094081

Signed-off-by: Partho Sarthi <[email protected]>

parthosa added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Oct 24, 2024

parthosa self-assigned this Oct 24, 2024

Fix pylint

028f36d

Signed-off-by: Partho Sarthi <[email protected]>

parthosa requested a review from tgravescs October 24, 2024 19:55

parthosa marked this pull request as ready for review October 24, 2024 19:56

cindyyuanjiang self-requested a review October 24, 2024 23:32

cindyyuanjiang approved these changes Oct 24, 2024

View reviewed changes

tgravescs reviewed Oct 25, 2024

View reviewed changes

parthosa requested a review from tgravescs October 25, 2024 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update error handling in python for parsing cluster information #1394

Update error handling in python for parsing cluster information #1394

parthosa commented Oct 24, 2024 •

edited

Loading

cindyyuanjiang left a comment

tgravescs Oct 25, 2024

parthosa Oct 25, 2024 •

edited

Loading

Update error handling in python for parsing cluster information #1394

Are you sure you want to change the base?

Update error handling in python for parsing cluster information #1394

Conversation

parthosa commented Oct 24, 2024 • edited Loading

Error Handling Improvements:

Test

cindyyuanjiang left a comment

Choose a reason for hiding this comment

tgravescs Oct 25, 2024

Choose a reason for hiding this comment

parthosa Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

parthosa commented Oct 24, 2024 •

edited

Loading

parthosa Oct 25, 2024 •

edited

Loading