Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error while inferring cluster when cluster properties file is provided to the CLI #1392

Open
parthosa opened this issue Oct 23, 2024 · 3 comments · May be fixed by #1394
Open

[BUG] Error while inferring cluster when cluster properties file is provided to the CLI #1392

parthosa opened this issue Oct 23, 2024 · 3 comments · May be fixed by #1394
Assignees
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)

Comments

@parthosa
Copy link
Collaborator

parthosa commented Oct 23, 2024

Describe the bug
When running the Tools CLI on Dataproc event logs with a custom cluster info file, an error occurs in the cluster inference module. This is probably due to some of recent changes to the cluster information on the Scala side, which are incompatible with the Python implementation.

Steps/Code to reproduce bug

spark_rapids qualification --platform dataproc --cluster test-cluster-info.json --eventlogs gs://spark-events --tools_jar <tools_jar> --verbose 

Cluster Info File: test-cluster-info.json

Output

2024-10-23 13:43:38,313 ERROR rapids.tools.cluster_inference: Error while inferring cluster: cannot convert float NaN to integer
2024-10-23 13:43:38,313 INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0009, Unable to infer CPU cluster. Reason - No matching worker node found for num cores = -4.0
2024-10-23 13:43:38,314 INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0008, Unable to infer CPU cluster. Reason - No matching worker node found for num cores = -4.0
2024-10-23 13:43:38,314 INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0007, Unable to infer CPU cluster. Reason - No matching worker node found for num cores = -4.0
2024-10-23 13:43:38,314 INFO rapids.tools.cluster_inference: For App ID: application_1717064107684_0006, Unable to infer CPU cluster. Reason - No matching worker node found for num cores = -4.0
@parthosa parthosa added ? - Needs Triage bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Oct 23, 2024
@amahussein amahussein changed the title [BUG] Cluster Inference: Error while inferring cluster: cannot convert float NaN to integer. [BUG] Error while inferring cluster when cluster properties file is provided to the CLI Oct 23, 2024
@parthosa parthosa self-assigned this Oct 23, 2024
@cindyyuanjiang
Copy link
Collaborator

Does the negative num cores = -4.0 come from Scala output? If not, where does this number come from?

@parthosa
Copy link
Collaborator Author

parthosa commented Oct 25, 2024

Total Cores is calculated as:

cores_per_executor = cluster_info_df.get('Cores Per Executor')
execs_per_node = cluster_info_df.get('Num Executors Per Node')
total_cores_per_node = execs_per_node * cores_per_executor

Scala output would generate Num Executors Per Node = -1. This will cause Total Cores = -4

@cindyyuanjiang
Copy link
Collaborator

Scala output generates Num Executors Per Node = -1 causing Total Cores to be -4

Thanks for clarifying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants