Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate cluster instances and generate cost savings #803

Merged
merged 11 commits into from
Feb 27, 2024

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Feb 23, 2024

This PR generates cost savings in user tools by estimating cluster shape (instances and count). It uses the cluster information json generated by the qualification core tool.

Usage and Output for EMR Platform:

CMD:

spark_rapids_user_tools emr qualification -t $TOOLS_JAR --eventlogs $HOME/Work/event-logs/emr-cpu
  1. Cluster Information Json (from Core Tools):
[ {
  "appName" : "NDS - Power Run",
  "appId" : "application_1692643187882_0001",
  "eventLogPath" : "emr-cpu/application_1692643187882_0001",
  "clusterInfo" : {
    "coresPerExecutor" : 16,
    "numExecutorNodes" : 8
  }
} ]
  1. Inferred Cluster (from User Tools):
INFO rapids.tools.qualification: Inferred Cluster => Driver: i3.2xlarge, Executor: 8 X m5d.4xlarge
  1. Cost Savings:
Report Summary:
------------------------------  ------
Total applications                   1
RAPIDS candidates                    1
Overall estimated speedup         1.98
Overall estimated cost savings  35.46%
------------------------------  ------

Instance types conversions:
-----------  --  ------------
m5d.4xlarge  to  g4dn.4xlarge
-----------  --  ------------

Output for other platforms can be found here - Link

Estimation of Cluster Shape:

  1. For EMR, Dataproc - Stores a map of cores -> instance type. Using the coresPerExecutor value from core tools, we get the instance type.
  2. For Databricks - Instance types are already defined in the json output. Use them directly.
  3. For each platform,:
    • store a default JSON which is same as the output while describing a cluster (eg for emr: aws describe-cluster --cluster-id <id>)
    • Populate this JSON from the cluster shape detected above.
    • Create an object of CPU cluster instance.
    • Rest of the flow remains same.

Limitation:

Currently estimation of only single cluster is supported in user tools. Supporting multiple cluster would require deeper design changes in user tools. We inform the user about this limitation in logging.

@parthosa parthosa added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Feb 23, 2024
@parthosa parthosa self-assigned this Feb 23, 2024
Signed-off-by: Partho Sarthi <[email protected]>
@mattahrens
Copy link
Collaborator

To confirm -- does this also work with the new CLI spark_rapids? Your example shows legacy command spark_rapids_user_tools and I want to make sure it works fine with new CLI.

@parthosa
Copy link
Collaborator Author

To confirm -- does this also work with the new CLI spark_rapids? Your example shows legacy command spark_rapids_user_tools and I want to make sure it works fine with new CLI.

Yes, it works with the new CLI. Since 24.02 is not released yet, new CLI is failing due to some of the changes made during this release. Hence I am using the old CLI and passing my local jar.

@amahussein
Copy link
Collaborator

Currently estimation of only single cluster is supported in user tools. Supporting multiple cluster would require deeper design changes in user tools. We inform the user about this limitation in logging.

The feature is not really complete until we make that change in the user_tools. With the current changes we can get away with it when we do 1 eventlog at a time.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hold a little bit to see what is the best strategy to move fwd.
We can use that PR for a quick POC for a single app. However, I think that we cannot fully merge the PR until we can change the entire code to support a cpu/gpu cluster per app.

Signed-off-by: Partho Sarthi <[email protected]>
@@ -103,6 +104,14 @@ def get_supported_gpus(self) -> dict:
gpu_scopes[mc_prof] = NodeHWInfo(sys_info=hw_info_ob, gpu_info=gpu_info_obj)
return gpu_scopes

def generate_cluster_configuration(self, render_args: dict):
executor_names = ','.join([
f'{{"node_id": "12345678900{i}"}}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why using prefix 12345678900?

Copy link
Collaborator Author

@parthosa parthosa Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is chosen to mimic the pattern of an actual cluster.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattahrens

Are you okay with the PR getting merged?
The limitation is that it only works if user-tools runs on a single eventlog.
I opened a new issue #809 to do the remaining changes in uder-tools

@mattahrens
Copy link
Collaborator

@mattahrens

Are you okay with the PR getting merged? The limitation is that it only works if user-tools runs on a single eventlog. I opened a new issue #809 to do the remaining changes in uder-tools

Yes I'm fine with it going on as it does enable new functionality for single event-log runs.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amahussein amahussein merged commit 6aa7cd8 into NVIDIA:dev Feb 27, 2024
13 checks passed
@parthosa parthosa deleted the spark-rapids-tools-581-user-tools branch February 29, 2024 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that
4 participants