Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog #1188

tgravescs · 2024-07-12T19:56:41Z

This has a bunch of changes in it. Currently we have basic integration to be able to run qualification auto tuner against cpu event logs.. But it isn't tied to any node recommedations or other features in the python user tools so user can get incorrect or inconsistent behavior, this fixes that.
Note, the assumption here is that you run a similar sized GPU cluster as compared to the CPU cluster. Similar sized to me means same number of executors with similar cores/memory (might not be exact). If you change number of executors then you change the parallelism possibilities.

Here are some of the changes:

tie together the options from python for passing clusters, if the cluster isn't provided we need infer the cluster setup from the event log and infer the gpu cluster from that. if the cluster is provided we keep the same behavior as before.To make this happen we need a mapping to instance type in the scala code.
Make sure the recommendations for memory are really based on the instance type to make sure heap, offheap, etc all will fit.
Make sure that we are recommending the correct number of nodes based on the instance type. For instance CPU might run 2 executors on a 32 core box. But for that CSP there are no GPU nodes that can have 2 gpus on it, so you have to make that 2 nodes of 16 cores each with 1gpu to keep the number of executors the same.
Output in the python side should be a per app recommendation for the cluster setup and tunings unless the --cluster option is passed into the python user_tools. if --cluster passed its assumed each job ran on that type of cluster (but still has bugs in my opinion).
Tie back the scala code tunings and node recommendation inferred to the final python output in the cluster shape recommendation
Qualification output should include tuning configs and node recommendations in summary output and add some * to try to get attention to how we are qualifying jobs

There are a bunch of things that still need further enhancement and most of those have issues or are documented in #1152

Tested via existing unit tests and then I manually tested qualification against, databricks aws, emr, databricks azure, dataproc (all of the flavors), and on prem.

Example output running on Dataproc without passing the --cluster option so it does a per app level recommendation:

+----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+
|    | App Name              | App ID                         | Estimated GPU   | Qualified Node   | Full Cluster                        | GPU Config                         |
|    |                       |                                | Speedup         | Recommendation   | Config                              | Recommendation                     |
|    |                       |                                | Category**      |                  | Recommendations*                    | Breakdown*                         |
|----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------|
|  0 | NDS - Power Run       | application_1709275675195_0233 | Large           | n1-standard-32   | application_1709275675195_0233.conf | application_1709275675195_0233.log |
|  1 | NDS - transcode - orc | application_1716441747790_0001 | Large           | n1-standard-16   | application_1716441747790_0001.conf | application_1716441747790_0001.log |
+----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+
* Config Recommendations can be found in /home/user/workspace/spark-rapids-tools2/user_tools/qual_20240712195330_eD3Ff366/rapids_4_spark_qualification_output/tuning
** Estimated GPU Speedup Category assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

Same recommendation when you do pass in the --cluster option - which implies all the apps ran on this type of cluster and it uses that cluster instead of cluster info from event log:


+----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+
|    | App Name              | App ID                         | Estimated GPU   | Qualified Node   | Full Cluster                        | GPU Config                         |
|    |                       |                                | Speedup         | Recommendation   | Config                              | Recommendation                     |
|    |                       |                                | Category**      |                  | Recommendations*                    | Breakdown*                         |
|----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------|
|  0 | NDS - Power Run       | application_1709275675195_0233 | Large           | n1-standard-2    | application_1709275675195_0233.conf | application_1709275675195_0233.log |
|  1 | NDS - transcode - orc | application_1716441747790_0001 | Large           | n1-standard-2    | application_1716441747790_0001.conf | application_1716441747790_0001.log |
+----+-----------------------+--------------------------------+-----------------+------------------+-------------------------------------+------------------------------------+
* Config Recommendations can be found in /home/user/workspace/spark-rapids-tools2/user_tools/qual_20240712195506_6DD8468e/rapids_4_spark_qualification_output/tuning
** Estimated GPU Speedup Category assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

Signed-off-by: Thomas Graves <[email protected]>

…edautoheu

parthosa

Thanks @tgravescs for these changes. Made an initial pass to test e2e:

Currently, the python tool generates a file intermediate_output/cluster_shape_recommendation.json to store the final cluster shape recommendation.
- This PR does not seem to update the above JSON file correctly.
- After this PR, we would now have potentially two JSON files with cluster shape recommendation (one generated by scala side and one by python side)
- Users might get confused as which file to refer to.
In the console output, Col Qualified Node Recommendation: Should this be only node or cluster shape?
nit: Can we put 'Estimated GPU Speedup Category assumes the user is using the node...' in Notes sections

Notes:
--------------------
 - Apps with the same name are grouped together and their metrics are averaged
 - 'Estimated GPU Speedup Category' assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

user_tools/src/spark_rapids_pytools/cloud_api/dataproc.py

parthosa · 2024-07-15T21:39:33Z

user_tools/src/spark_rapids_pytools/rapids/qualification.py

+
+        conversion_items_summary = {}
+        if self.ctxt.get_ctxt('cpuClusterProxy'):
+            cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy')


nit: Can we simplify the nested if conditions by moving cpu_cluster_info and gpu_cluster_info before the first if ?

I don't follow, what first if statement?

We can probably refactor as:

if self.ctxt.get_ctxt('cpuClusterProxy'): cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy') gpu_cluster_info = self.ctxt.get_ctxt('gpuClusterProxy') if cpu_cluster_info is not None and gpu_cluster_info is not None:

as

cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy') gpu_cluster_info = self.ctxt.get_ctxt('gpuClusterProxy') if cpu_cluster_info: cpu_instance_type = cpu_cluster_info.get_worker_node().instance_type if gpu_cluster_info:

parthosa · 2024-07-15T21:40:45Z

user_tools/src/spark_rapids_pytools/rapids/qualification.py

+        if self.ctxt.get_ctxt('cpuClusterProxy'):
+            cpu_cluster_info = self.ctxt.get_ctxt('cpuClusterProxy')
+            gpu_cluster_info = self.ctxt.get_ctxt('gpuClusterProxy')
+            if cpu_cluster_info is not None and gpu_cluster_info is not None:


Can we store *_cluster_info.get_worker_node().instance_type as variables to avoid redundant calls and clean up (similar to below)?

parthosa · 2024-07-15T23:29:24Z

user_tools/src/spark_rapids_pytools/rapids/qualification.py

@@ -989,13 +1106,44 @@ def __infer_cluster_and_update_savings(self, cluster_info_df: pd.DataFrame):

        # Log the inferred cluster information and set the context
        self._log_inferred_cluster_info(cpu_cluster_obj)
+        self.logger.info('Inferred Cluster cpu node %s', cpu_cluster_obj)


We might not want to print the entire cluster object here.

user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml

amahussein

Thanks @tgravescs for the huge effort putting all those pieces together!
I was running the changes and I noticed the following:

For onPRem it shows nan. We might need to clarify that this is on purpose in the PR description. I think it will look better if nan is handled and replaced by N/A.

I see log mesages that could be due to incorrect handling of variables.

 2024-07-15 16:34:12,444 WARNING rapids.tools.qualification: Failed to get the worker node information for the GPU cluster AttributeError:'NoneType' object has no attribute 'get_worker_node'
 2024-07-15 16:34:12,444 WARNING rapids.tools.qualification: Cannot generate the cluster recommendation report because the cluster information is not available.
 2024-07-15 16:34:12,444 ERROR rapids.tools.qualification: Error generating the cluster recommendation report. Reason - AttributeError:'NoneType' object has no attribute 'get_cluster_configuration'

We previously fixed the flow to avoid requiring CSP authentication to retrieve some node information. It looks like the AutoTuning recommendation did not pick the same path that has that fix. All the node recommendations are nan for all CSP without valid CSP authentication. Perhaps @parthosa can help on closing that gap.
I was looking into the intermediate_output/cluster_shape_recommendation.csv . Both sourceCluster/targetCluster include gpuInfo set to T4. In fact bother clusters have the exact same information for dataproc. (attached here cluster_shape_recommendation.json)
Mentioning that we need to followup on this PR by revisiting Recommended cluster should use executors_per_node and cores_per_executor #1138 to fix the recommendations for Dataproc environments.

user_tools/src/spark_rapids_pytools/rapids/qualification.py

user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml

user_tools/src/spark_rapids_pytools/rapids/qualification.py

tgravescs · 2024-07-16T17:10:34Z

Thanks @tgravescs for these changes. Made an initial pass to test e2e:

Currently, the python tool generates a file intermediate_output/cluster_shape_recommendation.json to store the final cluster shape recommendation.

This PR does not seem to update the above JSON file correctly.

After this PR, we would now have potentially two JSON files with cluster shape recommendation (one generated by scala side and one by python side)

Users might get confused as which file to refer to.

In the console output, Col Qualified Node Recommendation: Should this be only node or cluster shape?

nit: Can we put 'Estimated GPU Speedup Category assumes the user is using the node...' in Notes sections
Notes:
--------------------
 - Apps with the same name are grouped together and their metrics are averaged
 - 'Estimated GPU Speedup Category' assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

I intentionally didn't modify the intermediate_output/cluster_shape_recommendation.json as its in intermediate output and that can be a followup. In my opinion if its in this directory it shouldn't be being used. I wanted to stop at some point and just get in what we have so multiple people can work on it. It doesn't change what it is for the worse.

In the console output, Col Qualified Node Recommendation: Should this be only node or cluster shape?

Right now it prints node recommendation so title matches. If we want to print more cluster information that is an enhancment. This is only in the text stdout so doesn't seem like a big deal to me but open to opinions.. But either way it can be a followup since I see cluster shape recommendation a followup thing we can do as long as coherent before release.

nit: Can we put 'Estimated GPU Speedup Category assumes the user is using the node...' in Notes sections

What Notes section? I am not familiar with it. If you are suggesting to put it there in addition I guess its fine, but how is that different from where it is now? I want it as close to that table as possible

tgravescs · 2024-07-16T17:15:37Z

Thanks @tgravescs for the huge effort putting all those pieces together! I was running the changes and I noticed the following:

For onPRem it shows nan. We might need to clarify that this is on purpose in the PR description. I think it will look better if nan is handled and replaced by N/A.

sure

I see log mesages that could be due to incorrect handling of variables.

 2024-07-15 16:34:12,444 WARNING rapids.tools.qualification: Failed to get the worker node information for the GPU cluster AttributeError:'NoneType' object has no attribute 'get_worker_node'
 2024-07-15 16:34:12,444 WARNING rapids.tools.qualification: Cannot generate the cluster recommendation report because the cluster information is not available.
 2024-07-15 16:34:12,444 ERROR rapids.tools.qualification: Error generating the cluster recommendation report. Reason - AttributeError:'NoneType' object has no attribute 'get_cluster_configuration'

I can look at the noneType, honestly there were a lot before. The middle one is explicit and should be there.

* We previously fixed the flow to avoid requiring CSP authentication to retrieve some node information. It looks like the AutoTuning recommendation did not pick the same path that has that fix. All the node recommendations are `nan` for all CSP without valid CSP authentication. Perhaps @parthosa can help on closing that gap.

Sure let me know.

I was looking into the intermediate_output/cluster_shape_recommendation.csv . Both sourceCluster/targetCluster include gpuInfo set to T4. In fact bother clusters have the exact same information for dataproc. (attached here cluster_shape_recommendation.json)

ok, what is the question or bug here you are pointing out? I was trying to not really touch this file. may have in modifying some variables.

…Category stdout since we don't want them in csv

…uupmerge

empty

amahussein

ok, what is the question or bug here you are pointing out? I was trying to not really touch this file. may have in modifying some variables.

My question here that the source cluster is a CPU-cluster. I don't expect to see GPU information within the source cluster.

user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml

user_tools/src/spark_rapids_pytools/resources/templates/cluster_template/dataproc.ms

amahussein

LGTME.
Thanks @tgravescs

tgravescs and others added 30 commits May 20, 2024 10:35

Moving base classes around to separate Qual and Profile auto tuner

3358c35

Signed-off-by: Thomas Graves <[email protected]>

checkpoint before pulling in pr

1177f7b

Patch from pr 1021

fddd5d0

Updated and working

23e3326

checkpoint

233e459

checkpoint

42641f0

Merge remote-tracking branch 'origin/dev' into autotunerqualupmerge

a85b402

remove unneeded logs

394f225

cleanup

eed81e5

comments

46243c2

Fix tests for changes in default GPU type

109d4f0

fix order imports

e7d079c

fix missing space

16978ff

Change heuristics in Autotuner

c8a882f

Signed-off-by: Thomas Graves <[email protected]>

Calculate heap memory to account for different scenarios

ffa2829

Signed-off-by: Thomas Graves <[email protected]>

fixes

56bf1bc

remove the DEF_SYSTEM_RESERVE_MB subtraction for standalone mode

f825b3b

fix double issue

a8d64e7

Checkpoint initial passing cluster info

0087472

fix merge

b817f6b

compiling

3fd38a6

checkpoint mappings

2d638f1

Merge remote-tracking branch 'origin/dev' into autoheur

1a1ff0e

updates

74de30e

add comments

ce9918a

update tests

36d0781

split apart functions

fefe394

checkpoint

6e4e826

Merge remote-tracking branch 'tgravescs/autoheur' into clusterInfoBas…

9928fad

…edautoheu

Add in executor memory to cluster info

382886b

tgravescs added 3 commits July 12, 2024 16:55

fix python syntax

e162e48

fix style

52802c7

fix typo

bdcf659

amahussein requested review from parthosa and amahussein July 15, 2024 20:21

amahussein added affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Jul 15, 2024

Fix missing formatting option

df664db

parthosa reviewed Jul 16, 2024

View reviewed changes

amahussein reviewed Jul 16, 2024

View reviewed changes

minor nits and then change the way ** added to Estimated GPU Speedup …

8853849

…Category stdout since we don't want them in csv

tgravescs mentioned this pull request Jul 17, 2024

[FEA] user tools qualification node recommendation should include more information and include number of gpus #1143

Closed

tgravescs added 5 commits July 17, 2024 13:24

Review comments

57fe7d6

Merge remote-tracking branch 'origin/dev' into clusterInfoBasedautohe…

b14e951

…uupmerge

fix syntax

8941c5d

Fix log messages with NoneType

a6315fa

Change to print Not Available instead of nan when node recommendation

6644132

empty

tgravescs mentioned this pull request Jul 19, 2024

Qualification tool should print Kryo related recommendations #1204

Merged

amahussein reviewed Jul 19, 2024

View reviewed changes

Fix review comments, extra spaces

1863e12

amahussein approved these changes Jul 19, 2024

View reviewed changes

tgravescs merged commit ed7bc92 into NVIDIA:dev Jul 19, 2024
14 checks passed

tgravescs mentioned this pull request Jul 19, 2024

[FEA] Qualification output should include tuning configs and node recommendations in summary output #997

Closed

amahussein mentioned this pull request Jul 19, 2024

[FEA] Remove CLI dependency for EMR and Databricks-AWS platforms in user tool #1196

Merged

This was referenced Jul 26, 2024

Fix cluster recommendation when CPU cluster cannot be determined amahussein/spark-rapids-tools#13

Merged

Remove arguments related to cost-savings #1230

Merged

tgravescs mentioned this pull request Jul 30, 2024

[BUG] Top level candidate table * and ** notes should be directly under the table. #1240

Closed

parthosa mentioned this pull request Aug 9, 2024

[BUG] Qualification CLI does not generate AutoTuning for onPrem #1115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog #1188

Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog #1188

tgravescs commented Jul 12, 2024

parthosa left a comment

parthosa Jul 15, 2024

tgravescs Jul 16, 2024

parthosa Jul 16, 2024

parthosa Jul 15, 2024

tgravescs Jul 16, 2024

parthosa Jul 15, 2024

tgravescs Jul 16, 2024

amahussein left a comment

tgravescs commented Jul 16, 2024 •

edited

Loading

tgravescs commented Jul 16, 2024

amahussein left a comment

amahussein left a comment

Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog #1188

Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog #1188

Conversation

tgravescs commented Jul 12, 2024

parthosa left a comment

Choose a reason for hiding this comment

parthosa Jul 15, 2024

Choose a reason for hiding this comment

tgravescs Jul 16, 2024

Choose a reason for hiding this comment

parthosa Jul 16, 2024

Choose a reason for hiding this comment

parthosa Jul 15, 2024

Choose a reason for hiding this comment

tgravescs Jul 16, 2024

Choose a reason for hiding this comment

parthosa Jul 15, 2024

Choose a reason for hiding this comment

tgravescs Jul 16, 2024

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

tgravescs commented Jul 16, 2024 • edited Loading

tgravescs commented Jul 16, 2024

amahussein left a comment

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

tgravescs commented Jul 16, 2024 •

edited

Loading