-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate cluster instances and generate cost savings #803
Estimate cluster instances and generate cost savings #803
Conversation
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
To confirm -- does this also work with the new CLI |
Yes, it works with the new CLI. Since 24.02 is not released yet, new CLI is failing due to some of the changes made during this release. Hence I am using the old CLI and passing my local jar. |
The feature is not really complete until we make that change in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's hold a little bit to see what is the best strategy to move fwd.
We can use that PR for a quick POC for a single app. However, I think that we cannot fully merge the PR until we can change the entire code to support a cpu/gpu cluster per app.
user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_pytools/resources/dataproc-configs.json
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_pytools/resources/dataproc-configs.json
Outdated
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
user_tools/src/spark_rapids_pytools/resources/templates/cluster_template/emr.ms
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_pytools/resources/templates/cluster_template/emr.ms
Show resolved
Hide resolved
user_tools/src/spark_rapids_pytools/resources/templates/cluster_template/dataproc.ms
Outdated
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
@@ -103,6 +104,14 @@ def get_supported_gpus(self) -> dict: | |||
gpu_scopes[mc_prof] = NodeHWInfo(sys_info=hw_info_ob, gpu_info=gpu_info_obj) | |||
return gpu_scopes | |||
|
|||
def generate_cluster_configuration(self, render_args: dict): | |||
executor_names = ','.join([ | |||
f'{{"node_id": "12345678900{i}"}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why using prefix 12345678900
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is chosen to mimic the pattern of an actual cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you okay with the PR getting merged?
The limitation is that it only works if user-tools runs on a single eventlog.
I opened a new issue #809 to do the remaining changes in uder-tools
Yes I'm fine with it going on as it does enable new functionality for single event-log runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR generates cost savings in user tools by estimating cluster shape (instances and count). It uses the cluster information json generated by the qualification core tool.
Usage and Output for EMR Platform:
CMD:
Output for other platforms can be found here - Link
Estimation of Cluster Shape:
cores -> instance type
. Using thecoresPerExecutor
value from core tools, we get the instance type.aws describe-cluster --cluster-id <id>
)Limitation:
Currently estimation of only single cluster is supported in user tools. Supporting multiple cluster would require deeper design changes in user tools. We inform the user about this limitation in logging.