-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog #1188
Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog #1188
Conversation
Signed-off-by: Thomas Graves <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tgravescs for these changes. Made an initial pass to test e2e:
- Currently, the python tool generates a file
intermediate_output/cluster_shape_recommendation.json
to store the final cluster shape recommendation.- This PR does not seem to update the above JSON file correctly.
- After this PR, we would now have potentially two JSON files with cluster shape recommendation (one generated by scala side and one by python side)
- Users might get confused as which file to refer to.
- In the console output, Col
Qualified Node Recommendation
: Should this be only node or cluster shape? - nit: Can we put 'Estimated GPU Speedup Category assumes the user is using the node...' in Notes sections
Notes:
--------------------
- Apps with the same name are grouped together and their metrics are averaged
- 'Estimated GPU Speedup Category' assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.
|
||
conversion_items_summary = {} | ||
if self.ctxt.get_ctxt('cpuClusterProxy'): | ||
cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we simplify the nested if conditions by moving cpu_cluster_info
and gpu_cluster_info
before the first if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow, what first if statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably refactor as:
if self.ctxt.get_ctxt('cpuClusterProxy'):
cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy')
gpu_cluster_info = self.ctxt.get_ctxt('gpuClusterProxy')
if cpu_cluster_info is not None and gpu_cluster_info is not None:
as
cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy')
gpu_cluster_info = self.ctxt.get_ctxt('gpuClusterProxy')
if cpu_cluster_info:
cpu_instance_type = cpu_cluster_info.get_worker_node().instance_type
if gpu_cluster_info:
if self.ctxt.get_ctxt('cpuClusterProxy'): | ||
cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy') | ||
gpu_cluster_info = self.ctxt.get_ctxt('gpuClusterProxy') | ||
if cpu_cluster_info is not None and gpu_cluster_info is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we store *_cluster_info.get_worker_node().instance_type
as variables to avoid redundant calls and clean up (similar to below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating
@@ -989,13 +1106,44 @@ def __infer_cluster_and_update_savings(self, cluster_info_df: pd.DataFrame): | |||
|
|||
# Log the inferred cluster information and set the context | |||
self._log_inferred_cluster_info(cpu_cluster_obj) | |||
self.logger.info('Inferred Cluster cpu node %s', cpu_cluster_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might not want to print the entire cluster object here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tgravescs for the huge effort putting all those pieces together!
I was running the changes and I noticed the following:
- For onPRem it shows nan. We might need to clarify that this is on purpose in the PR description. I think it will look better if
nan
is handled and replaced byN/A
. - I see log mesages that could be due to incorrect handling of variables.
2024-07-15 16:34:12,444 WARNING rapids.tools.qualification: Failed to get the worker node information for the GPU cluster AttributeError:'NoneType' object has no attribute 'get_worker_node' 2024-07-15 16:34:12,444 WARNING rapids.tools.qualification: Cannot generate the cluster recommendation report because the cluster information is not available. 2024-07-15 16:34:12,444 ERROR rapids.tools.qualification: Error generating the cluster recommendation report. Reason - AttributeError:'NoneType' object has no attribute 'get_cluster_configuration'
- We previously fixed the flow to avoid requiring CSP authentication to retrieve some node information. It looks like the AutoTuning recommendation did not pick the same path that has that fix. All the node recommendations are
nan
for all CSP without valid CSP authentication. Perhaps @parthosa can help on closing that gap. - I was looking into the
intermediate_output/cluster_shape_recommendation.csv
. BothsourceCluster/targetCluster
include gpuInfo set to T4. In fact bother clusters have the exact same information for dataproc. (attached here cluster_shape_recommendation.json) - Mentioning that we need to followup on this PR by revisiting Recommended cluster should use executors_per_node and cores_per_executor #1138 to fix the recommendations for Dataproc environments.
user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml
Outdated
Show resolved
Hide resolved
I intentionally didn't modify the intermediate_output/cluster_shape_recommendation.json as its in intermediate output and that can be a followup. In my opinion if its in this directory it shouldn't be being used. I wanted to stop at some point and just get in what we have so multiple people can work on it. It doesn't change what it is for the worse.
Right now it prints node recommendation so title matches. If we want to print more cluster information that is an enhancment. This is only in the text stdout so doesn't seem like a big deal to me but open to opinions.. But either way it can be a followup since I see cluster shape recommendation a followup thing we can do as long as coherent before release.
What Notes section? I am not familiar with it. If you are suggesting to put it there in addition I guess its fine, but how is that different from where it is now? I want it as close to that table as possible |
sure
I can look at the noneType, honestly there were a lot before. The middle one is explicit and should be there.
Sure let me know.
ok, what is the question or bug here you are pointing out? I was trying to not really touch this file. may have in modifying some variables. |
…Category stdout since we don't want them in csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, what is the question or bug here you are pointing out? I was trying to not really touch this file. may have in modifying some variables.
My question here that the source cluster is a CPU-cluster. I don't expect to see GPU information within the source cluster.
user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml
Outdated
Show resolved
Hide resolved
user_tools/src/spark_rapids_pytools/resources/templates/cluster_template/dataproc.ms
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTME.
Thanks @tgravescs
fixes #1160 and #997
This has a bunch of changes in it. Currently we have basic integration to be able to run qualification auto tuner against cpu event logs.. But it isn't tied to any node recommedations or other features in the python user tools so user can get incorrect or inconsistent behavior, this fixes that.
Note, the assumption here is that you run a similar sized GPU cluster as compared to the CPU cluster. Similar sized to me means same number of executors with similar cores/memory (might not be exact). If you change number of executors then you change the parallelism possibilities.
Here are some of the changes:
There are a bunch of things that still need further enhancement and most of those have issues or are documented in #1152
Tested via existing unit tests and then I manually tested qualification against, databricks aws, emr, databricks azure, dataproc (all of the flavors), and on prem.
Example output running on Dataproc without passing the --cluster option so it does a per app level recommendation:
Same recommendation when you do pass in the --cluster option - which implies all the apps ran on this type of cluster and it uses that cluster instead of cluster info from event log: