Generate summary metadata file and fix node recommendation in python #1216

parthosa · 2024-07-23T17:09:15Z

Fixes #1143 , Fixes #1215.

This PR introduces the following and some code refactoring:

Generates a metadata JSON file containing information such as tuning file, recommended cluster, speedup. Additionally, it fixes a bug. File: qualification_summary_metadata.json.
Fixes a bug in python when node recommendations not generated for any apps if some apps lack node recommendations.

Code Changes

Introduced a class ClusterConfigRecommender that cleans up the code by tying together the following:
- Calculate the columns for cluster tuning config files and cluster recommendation
- Add these columns to the final df processed by tools
Introduced a method qualification.py::_write_summary_metadata() (issue-1143)
- This method will write the specified cols to metadata JSON file
- It appends the full path to cluster tuning config files
- Converts column names to camel case
In method qualification.py::__infer_cluster_for_auto_tuning() (issue-1215)
- If we cannot create CPU cluster for an app, instead of returning we should continue inferring CPU cluster for the remaining apps.

Output

Metadata JSON

File: `qual_20240723160911_B043Bae0/qualification_summary_metadata.json`

[
  {
    "appId": "app-20240311074805-0000",
    "appName": "test_app_11111",
    "sourceCluster": {
      "driverInstance": "r5d.2xlarge",
      "executorInstance": "r5d.2xlarge",
      "numExecutors": 2
    },
    "recommendedCluster": {
      "driverInstance": "r5d.2xlarge",
      "executorInstance": "g5.2xlarge",
      "numExecutors": 2
    },
    "estimatedGpuSpeedupCategory": "Medium",
    "fullClusterConfigRecommendations": "/path/qual_20240723160911_B043Bae0/rapids_4_spark_qualification_output/tuning/app-20240311074805-0000.conf",
    "gpuConfigRecommendationBreakdown": "/path/qual_20240723160911_B043Bae0/rapids_4_spark_qualification_output/tuning/app-20240311074805-0000.log"
  },
  {
    "appId": "app-20240312011448-0000",
    "appName": "test_app_22222",
    "sourceCluster": {},
    "recommendedCluster": {},
    "estimatedGpuSpeedupCategory": "Not Recommended",
    "fullClusterConfigRecommendations": "/path/qual_20240723160911_B043Bae0/rapids_4_spark_qualification_output/tuning/app-20240312011448-0000.conf",
    "gpuConfigRecommendationBreakdown": "/path/qual_20240723160911_B043Bae0/rapids_4_spark_qualification_output/tuning/app-20240312011448-0000.log"
  }
]

Node Recommendation Fix

See below. Added as a separate comment for readability

To Discuss:

In cluster_inference.py::get_cluster_template_args ()
- We are trying to read 'Recommended Executor Instance' while creating CPU instance object.
- This may cause an error because these are GPU instances.
- Removed the block
We should not generate instance type conversion section anymore since the conversions are now per-app instead.
- Removed sections_generators=[self.__generate_mc_types_conversion_report],
- Can also remove function all together __generate_mc_types_conversion_report()

Follow Up

In a follow up PR (show status report in console), event logs will be added as a property in the metadata file.

Signed-off-by: Partho Sarthi <[email protected]>

amahussein

If we are generating a new file qualification_summary_metadata.json do we get rid of the previous file that was showing source/target clusters?
I am not 100% yet on what would be that file name. too many qualification_summary* on the root level.
in the PR description, I cannot see the same notes in "before" and "after"

user_tools/src/spark_rapids_tools/utils/util.py

parthosa · 2024-07-23T20:05:26Z

Output

Node Recommendation Fix

CMD:

spark_rapids qualification --platform databricks-aws  --eventlogs $EVENTLOG_PATH  --tools_jar $SPARK_RAPIDS_TOOLS_JAR

Before: Incorrect Node Recommendation

    9 directories, 86 files
    - To learn more about the output details, visit https://docs.nvidia.com/spark-rapids/user-guide/latest/qualification/quickstart.html#qualification-output
    - Summarized savings and speedups CSV report: /path/qual_20240723201337_7dCB4E86/qualification_summary.csv
    - Full savings and speedups CSV report: /path/qual_20240723201337_7dCB4E86/qualification_summary_full.csv
    - Intermediate output generated by tools: /path/qual_20240723201337_7dCB4E86/intermediate_output
+----+-------------------------+-------------------------+-----------------+------------------+------------------------------+-----------------------------+
|    | App Name                | App ID                  | Estimated GPU   | Qualified Node   | Full Cluster                 | GPU Config                  |
|    |                         |                         | Speedup         | Recommendation   | Config                       | Recommendation              |
|    |                         |                         | Category**      |                  | Recommendations*             | Breakdown*                  |
|----+-------------------------+-------------------------+-----------------+------------------+------------------------------+-----------------------------|
|  1 | test_app_11111          | app-20240311195738-0000 | Medium          | Not Available    | app-20240311195738-0000.conf | app-20240311195738-0000.log |
|  0 | test_app_22222          | app-20240311074805-0000 | Medium          | Not Available    | app-20240311074805-0000.conf | app-20240311074805-0000.log |
+----+-------------------------+-------------------------+-----------------+------------------+------------------------------+-----------------------------+
* Config Recommendations can be found in /path/qual_20240723201337_7dCB4E86/rapids_4_spark_qualification_output/tuning
** Estimated GPU Speedup Category assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

Report Summary:
------------------  -
Total applications  4
Top candidates      2
------------------  -

After: Correct Node Recommendation

    9 directories, 87 files
    - To learn more about the output details, visit https://docs.nvidia.com/spark-rapids/user-guide/latest/qualification/quickstart.html#qualification-output
    - Summarized savings and speedups CSV report: /path/qual_20240723160911_B043Bae0/qualification_summary.csv
    - Full savings and speedups CSV report: /path/qual_20240723160911_B043Bae0/qualification_summary_full.csv
    - Intermediate output generated by tools: /path/qual_20240723160911_B043Bae0/intermediate_output
    - Metadata for the summary report: /path/qual_20240723160911_B043Bae0/qualification_summary_metadata.json
    - Config Recommendations: /path/qual_20240723160911_B043Bae0/rapids_4_spark_qualification_output/tuning
+----+-------------------------+-------------------------+-----------------+---------------------------+------------------------------+-----------------------------+
|    | App Name                | App ID                  | Estimated GPU   | Qualified Node            | Full Cluster                 | GPU Config                  |
|    |                         |                         | Speedup         | Recommendation            | Config                       | Recommendation              |
|    |                         |                         | Category**      |                           | Recommendations*             | Breakdown*                  |
|----+-------------------------+-------------------------+-----------------+---------------------------+------------------------------+-----------------------------|
|  1 | test_app_11111	       | app-20240311195738-0000 | Medium          | r5d.2xlarge to g5.2xlarge | app-20240311195738-0000.conf | app-20240311195738-0000.log |
|  0 | test_app_22222          | app-20240311074805-0000 | Medium          | r5d.2xlarge to g5.2xlarge | app-20240311074805-0000.conf | app-20240311074805-0000.log |
+----+-------------------------+-------------------------+-----------------+---------------------------+------------------------------+-----------------------------+

Report Summary:
------------------  -
Total applications  4
Top candidates      2
------------------  -

Notes:
--------------------
 - 'Estimated GPU Speedup Category' assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

parthosa · 2024-07-23T20:06:36Z

Thanks @amahussein.

If we are generating a new file qualification_summary_metadata.json do we get rid of the previous file that was showing source/target clusters?

Yes, the previous file has been removed since it contained only cluster information.

I am not 100% yet on what would be that file name. too many qualification_summary* on the root level.

As part of #1099, we should get rid of summary csv grouped by name. In that case we will have only two files: qualification_summary.csv and qualification_summary_metadata.json at root level.

in the PR description, I cannot see the same notes in "before" and "after"

There seems to be some rendering issue in Github markdown due to special characters. Added it as a separate comment above.

Signed-off-by: Partho Sarthi <[email protected]>

user_tools/src/spark_rapids_tools/tools/cluster_config_recommender.py

tgravescs · 2024-07-30T14:53:00Z

filed #1239 to followup on this

Refactor cluster recommendation and utility methods

5f702fb

Signed-off-by: Partho Sarthi <[email protected]>

parthosa added bug Something isn't working feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Jul 23, 2024

parthosa requested review from tgravescs, cindyyuanjiang, amahussein and nartal1 July 23, 2024 17:09

parthosa self-assigned this Jul 23, 2024

parthosa added the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Jul 23, 2024

parthosa marked this pull request as ready for review July 23, 2024 17:20

amahussein reviewed Jul 23, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/utils/util.py Outdated Show resolved Hide resolved

parthosa added 3 commits July 23, 2024 14:53

Reuse existing methods and rename outfile file comment

5fe2b92

Signed-off-by: Partho Sarthi <[email protected]>

Add source cluster config in metadata file

de1ca47

Signed-off-by: Partho Sarthi <[email protected]>

Fix union typing

bbc4296

Signed-off-by: Partho Sarthi <[email protected]>

parthosa requested a review from amahussein July 23, 2024 23:32

nartal1 reviewed Jul 24, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/tools/cluster_config_recommender.py Show resolved Hide resolved

amahussein approved these changes Jul 25, 2024

View reviewed changes

amahussein merged commit c2af6e8 into NVIDIA:dev Jul 25, 2024
14 checks passed

tgravescs mentioned this pull request Jul 30, 2024

[BUG] Review and update recommended Cluster info in metadata json file #1239

Closed

tgravescs mentioned this pull request Jul 30, 2024

[BUG] Top level candidate table * and ** notes should be directly under the table. #1240

Closed

parthosa deleted the spark-rapids-tools-1143 branch October 9, 2024 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate summary metadata file and fix node recommendation in python #1216

Generate summary metadata file and fix node recommendation in python #1216

parthosa commented Jul 23, 2024 •

edited

Loading

amahussein left a comment

parthosa commented Jul 23, 2024 •

edited

Loading

parthosa commented Jul 23, 2024 •

edited

Loading

tgravescs commented Jul 30, 2024

Generate summary metadata file and fix node recommendation in python #1216

Generate summary metadata file and fix node recommendation in python #1216

Conversation

parthosa commented Jul 23, 2024 • edited Loading

Code Changes

Output

Metadata JSON

Node Recommendation Fix

To Discuss:

Follow Up

amahussein left a comment

Choose a reason for hiding this comment

parthosa commented Jul 23, 2024 • edited Loading

Output

Node Recommendation Fix

Before: Incorrect Node Recommendation

After: Correct Node Recommendation

parthosa commented Jul 23, 2024 • edited Loading

tgravescs commented Jul 30, 2024

parthosa commented Jul 23, 2024 •

edited

Loading

parthosa commented Jul 23, 2024 •

edited

Loading

parthosa commented Jul 23, 2024 •

edited

Loading