Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update error handling in python for parsing cluster information #1394

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Oct 24, 2024

Fixes #1392.

The Python code parses cluster information generated by Scala for display on STDOUT and store in app_metadata.json.

This PR ensures that new changes on the Scala side are properly handled by Python.

Error Handling Improvements:

Added a check to log an error:

  • If the number of worker nodes is invalid.
  • If the total cores per node is invalid for on-prem platforms.
  • If worker instance type is not set and the total cores per node is invalid for CSPs

Test

  • Tested on the original event logs that had caused this error.
  • In the given set of event logs, Scala side could not generate cluster information and would log the error message as follows:
Qualification-0 WARN QualificationAppInfo: Could not determine if any executors were allocated or the number of cores used per executor. Can't build existing cluster information!
  • However, the Python side would not handle this case properly and show the logs as:
ERROR rapids.tools.cluster_inference: Error while inferring cluster: cannot convert float NaN to integer
INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0009, Unable to infer CPU cluster. Reason - No matching worker node found for num cores = -4.0
INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0008, Unable to infer CPU cluster. Reason - No matching worker node found for num cores = -4.0
  • After this change, the logs in Python align with the behavior from Scala side.
INFO rapids.tools.cluster_inference: For App ID: application_1666141048720_0001, Unable to infer CPU cluster. Reason - Number of worker nodes cannot be determined. See logs for details.
INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0009, Unable to infer CPU cluster. Reason - Total cores per node cannot be determined. See logs for details.
INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0008, Unable to infer CPU cluster. Reason - Total cores per node cannot be determined. See logs for details.

@parthosa parthosa added bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Oct 24, 2024
@parthosa parthosa self-assigned this Oct 24, 2024
Signed-off-by: Partho Sarthi <[email protected]>
@parthosa parthosa marked this pull request as ready for review October 24, 2024 19:56
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa!

# If number of worker nodes is invalid, log error and return
if pd.isna(num_worker_nodes) or num_worker_nodes <= 0:
self._log_inference_failure(app_id, 'Number of worker nodes cannot be determined. '
'See logs for details.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only comment here is what logs are you wanting the user to look at? Is there actual logs that are useful for user to know something... if not I would just leave "See logs for details" off.
for instance, Some of these is expected for like onprem where we don't want to tell them the number since dynamic allocation may be on.

Copy link
Collaborator Author

@parthosa parthosa Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, I want the users to refer to Scala logs since it would have the specific reason.

For example in this case, Scala logs would have

Qualification-0 WARN QualificationAppInfo: Could not determine if any executors were allocated or the number of cores used per executor. Can't build existing cluster information!

However, from a user perspective, they would see both logs from Scala and Python on the console together. I dont think we can distinguish between these and point them to Scala logs specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Error while inferring cluster when cluster properties file is provided to the CLI
3 participants