Estimate cluster instances and generate cost savings #803

parthosa · 2024-02-23T01:33:27Z

This PR generates cost savings in user tools by estimating cluster shape (instances and count). It uses the cluster information json generated by the qualification core tool.

Usage and Output for EMR Platform:

CMD:

spark_rapids_user_tools emr qualification -t $TOOLS_JAR --eventlogs $HOME/Work/event-logs/emr-cpu

Cluster Information Json (from Core Tools):

[ {
  "appName" : "NDS - Power Run",
  "appId" : "application_1692643187882_0001",
  "eventLogPath" : "emr-cpu/application_1692643187882_0001",
  "clusterInfo" : {
    "coresPerExecutor" : 16,
    "numExecutorNodes" : 8
  }
} ]

Inferred Cluster (from User Tools):

INFO rapids.tools.qualification: Inferred Cluster => Driver: i3.2xlarge, Executor: 8 X m5d.4xlarge

Cost Savings:

Report Summary:
------------------------------  ------
Total applications                   1
RAPIDS candidates                    1
Overall estimated speedup         1.98
Overall estimated cost savings  35.46%
------------------------------  ------

Instance types conversions:
-----------  --  ------------
m5d.4xlarge  to  g4dn.4xlarge
-----------  --  ------------

Output for other platforms can be found here - Link

Estimation of Cluster Shape:

For EMR, Dataproc - Stores a map of cores -> instance type. Using the coresPerExecutor value from core tools, we get the instance type.
For Databricks - Instance types are already defined in the json output. Use them directly.
For each platform,:
- store a default JSON which is same as the output while describing a cluster (eg for emr: aws describe-cluster --cluster-id <id>)
- Populate this JSON from the cluster shape detected above.
- Create an object of CPU cluster instance.
- Rest of the flow remains same.

Limitation:

Currently estimation of only single cluster is supported in user tools. Supporting multiple cluster would require deeper design changes in user tools. We inform the user about this limitation in logging.

Signed-off-by: Partho Sarthi <[email protected]>

mattahrens · 2024-02-23T15:16:14Z

To confirm -- does this also work with the new CLI spark_rapids? Your example shows legacy command spark_rapids_user_tools and I want to make sure it works fine with new CLI.

parthosa · 2024-02-23T17:30:58Z

To confirm -- does this also work with the new CLI spark_rapids? Your example shows legacy command spark_rapids_user_tools and I want to make sure it works fine with new CLI.

Yes, it works with the new CLI. Since 24.02 is not released yet, new CLI is failing due to some of the changes made during this release. Hence I am using the old CLI and passing my local jar.

amahussein · 2024-02-23T17:51:13Z

Currently estimation of only single cluster is supported in user tools. Supporting multiple cluster would require deeper design changes in user tools. We inform the user about this limitation in logging.

The feature is not really complete until we make that change in the user_tools. With the current changes we can get away with it when we do 1 eventlog at a time.

amahussein

Let's hold a little bit to see what is the best strategy to move fwd.
We can use that PR for a quick POC for a single app. However, I think that we cannot fully merge the PR until we can change the entire code to support a cpu/gpu cluster per app.

user_tools/src/spark_rapids_pytools/cloud_api/onprem.py

user_tools/src/spark_rapids_pytools/cloud_api/dataproc_gke.py

user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py

user_tools/src/spark_rapids_pytools/cloud_api/emr.py

user_tools/src/spark_rapids_pytools/resources/dataproc-configs.json

user_tools/src/spark_rapids_pytools/resources/emr-configs.json

Signed-off-by: Partho Sarthi <[email protected]>

user_tools/src/spark_rapids_pytools/resources/templates/cluster_template/emr.ms

user_tools/src/spark_rapids_pytools/resources/templates/cluster_template/dataproc.ms

user_tools/src/spark_rapids_pytools/resources/dataproc-configs.json

Signed-off-by: Partho Sarthi <[email protected]>

cindyyuanjiang · 2024-02-27T01:51:51Z

user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py

@@ -103,6 +104,14 @@ def get_supported_gpus(self) -> dict:
            gpu_scopes[mc_prof] = NodeHWInfo(sys_info=hw_info_ob, gpu_info=gpu_info_obj)
        return gpu_scopes

+    def generate_cluster_configuration(self, render_args: dict):
+        executor_names = ','.join([
+            f'{{"node_id": "12345678900{i}"}}'


Is there a reason why using prefix 12345678900?

This is chosen to mimic the pattern of an actual cluster.

amahussein

@mattahrens

Are you okay with the PR getting merged?
The limitation is that it only works if user-tools runs on a single eventlog.
I opened a new issue #809 to do the remaining changes in uder-tools

mattahrens · 2024-02-27T16:20:36Z

@mattahrens

Are you okay with the PR getting merged? The limitation is that it only works if user-tools runs on a single eventlog. I opened a new issue #809 to do the remaining changes in uder-tools

Yes I'm fine with it going on as it does enable new functionality for single event-log runs.

amahussein

LGTM

Parse cluster informtion and create CPU cluster object in user tools

82032e1

Signed-off-by: Partho Sarthi <[email protected]>

parthosa added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Feb 23, 2024

parthosa requested review from cindyyuanjiang, amahussein and nartal1 February 23, 2024 01:33

parthosa self-assigned this Feb 23, 2024

parthosa mentioned this pull request Feb 23, 2024

[FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that #581

Closed

Fix linting

27d9aae

Signed-off-by: Partho Sarthi <[email protected]>

parthosa requested a review from mattahrens February 23, 2024 01:59

parthosa linked an issue Feb 23, 2024 that may be closed by this pull request

[FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that #581

Closed

amahussein requested changes Feb 23, 2024

View reviewed changes

parthosa added 7 commits February 23, 2024 14:10

Rename dict fields to camel case

61a0596

Signed-off-by: Partho Sarthi <[email protected]>

Merge branch 'dev' into spark-rapids-tools-581-user-tools

d9aabbf

Refactor code to use mustache templating

191bed5

Signed-off-by: Partho Sarthi <[email protected]>

Fix linting

2bfc12f

Signed-off-by: Partho Sarthi <[email protected]>

Remove typing

bd7b41f

Signed-off-by: Partho Sarthi <[email protected]>

Refactor instance type detection in dataproc

38f8373

Signed-off-by: Partho Sarthi <[email protected]>

Merge branch 'dev' into spark-rapids-tools-581-user-tools

0338acd

parthosa requested a review from amahussein February 26, 2024 18:11

Fix type

4a3cb76

Signed-off-by: Partho Sarthi <[email protected]>

amahussein reviewed Feb 26, 2024

View reviewed changes

Update template files with number of driver nodes

e25db35

Signed-off-by: Partho Sarthi <[email protected]>

parthosa requested a review from amahussein February 26, 2024 23:26

cindyyuanjiang reviewed Feb 27, 2024

View reviewed changes

amahussein mentioned this pull request Feb 27, 2024

[FEA] Calculate cost-savings based on cluster-shape per eventlog #809

Closed

amahussein reviewed Feb 27, 2024

View reviewed changes

amahussein approved these changes Feb 27, 2024

View reviewed changes

amahussein merged commit 6aa7cd8 into NVIDIA:dev Feb 27, 2024
13 checks passed

parthosa deleted the spark-rapids-tools-581-user-tools branch February 29, 2024 06:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate cluster instances and generate cost savings #803

Estimate cluster instances and generate cost savings #803

parthosa commented Feb 23, 2024 •

edited by amahussein

Loading

mattahrens commented Feb 23, 2024

parthosa commented Feb 23, 2024

amahussein commented Feb 23, 2024

amahussein left a comment

cindyyuanjiang Feb 27, 2024

parthosa Feb 27, 2024 •

edited

Loading

amahussein left a comment

mattahrens commented Feb 27, 2024

amahussein left a comment

Estimate cluster instances and generate cost savings #803

Estimate cluster instances and generate cost savings #803

Conversation

parthosa commented Feb 23, 2024 • edited by amahussein Loading

Usage and Output for EMR Platform:

Estimation of Cluster Shape:

Limitation:

mattahrens commented Feb 23, 2024

parthosa commented Feb 23, 2024

amahussein commented Feb 23, 2024

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang Feb 27, 2024

Choose a reason for hiding this comment

parthosa Feb 27, 2024 • edited Loading

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

mattahrens commented Feb 27, 2024

amahussein left a comment

Choose a reason for hiding this comment

parthosa commented Feb 23, 2024 •

edited by amahussein

Loading

parthosa Feb 27, 2024 •

edited

Loading