Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate summary metadata file and fix node recommendation in python #1216

Merged
merged 4 commits into from
Jul 25, 2024

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Jul 23, 2024

Fixes #1143 , Fixes #1215.

This PR introduces the following and some code refactoring:

  1. Generates a metadata JSON file containing information such as tuning file, recommended cluster, speedup. Additionally, it fixes a bug. File: qualification_summary_metadata.json.
  2. Fixes a bug in python when node recommendations not generated for any apps if some apps lack node recommendations.

Code Changes

  1. Introduced a class ClusterConfigRecommender that cleans up the code by tying together the following:
    • Calculate the columns for cluster tuning config files and cluster recommendation
    • Add these columns to the final df processed by tools
  2. Introduced a method qualification.py::_write_summary_metadata() (issue-1143)
    • This method will write the specified cols to metadata JSON file
    • It appends the full path to cluster tuning config files
    • Converts column names to camel case
  3. In method qualification.py::__infer_cluster_for_auto_tuning() (issue-1215)
    • If we cannot create CPU cluster for an app, instead of returning we should continue inferring CPU cluster for the remaining apps.

Output

Metadata JSON

File: `qual_20240723160911_B043Bae0/qualification_summary_metadata.json`
[
  {
    "appId": "app-20240311074805-0000",
    "appName": "test_app_11111",
    "sourceCluster": {
      "driverInstance": "r5d.2xlarge",
      "executorInstance": "r5d.2xlarge",
      "numExecutors": 2
    },
    "recommendedCluster": {
      "driverInstance": "r5d.2xlarge",
      "executorInstance": "g5.2xlarge",
      "numExecutors": 2
    },
    "estimatedGpuSpeedupCategory": "Medium",
    "fullClusterConfigRecommendations": "/path/qual_20240723160911_B043Bae0/rapids_4_spark_qualification_output/tuning/app-20240311074805-0000.conf",
    "gpuConfigRecommendationBreakdown": "/path/qual_20240723160911_B043Bae0/rapids_4_spark_qualification_output/tuning/app-20240311074805-0000.log"
  },
  {
    "appId": "app-20240312011448-0000",
    "appName": "test_app_22222",
    "sourceCluster": {},
    "recommendedCluster": {},
    "estimatedGpuSpeedupCategory": "Not Recommended",
    "fullClusterConfigRecommendations": "/path/qual_20240723160911_B043Bae0/rapids_4_spark_qualification_output/tuning/app-20240312011448-0000.conf",
    "gpuConfigRecommendationBreakdown": "/path/qual_20240723160911_B043Bae0/rapids_4_spark_qualification_output/tuning/app-20240312011448-0000.log"
  }
]

Node Recommendation Fix

See below. Added as a separate comment for readability

To Discuss:

  1. In cluster_inference.py::get_cluster_template_args ()
    • We are trying to read 'Recommended Executor Instance' while creating CPU instance object.
    • This may cause an error because these are GPU instances.
    • Removed the block
  2. We should not generate instance type conversion section anymore since the conversions are now per-app instead.
    • Removed sections_generators=[self.__generate_mc_types_conversion_report],
    • Can also remove function all together __generate_mc_types_conversion_report()

Follow Up

  • In a follow up PR (show status report in console), event logs will be added as a property in the metadata file.

@parthosa parthosa added bug Something isn't working feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Jul 23, 2024
@parthosa parthosa self-assigned this Jul 23, 2024
@parthosa parthosa added the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Jul 23, 2024
@parthosa parthosa marked this pull request as ready for review July 23, 2024 17:20
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If we are generating a new file qualification_summary_metadata.json do we get rid of the previous file that was showing source/target clusters?
  • I am not 100% yet on what would be that file name. too many qualification_summary* on the root level.
  • in the PR description, I cannot see the same notes in "before" and "after"

user_tools/src/spark_rapids_tools/utils/util.py Outdated Show resolved Hide resolved
@parthosa
Copy link
Collaborator Author

parthosa commented Jul 23, 2024

Output

Node Recommendation Fix

CMD:

spark_rapids qualification --platform databricks-aws  --eventlogs $EVENTLOG_PATH  --tools_jar $SPARK_RAPIDS_TOOLS_JAR

Before: Incorrect Node Recommendation

    9 directories, 86 files
    - To learn more about the output details, visit https://docs.nvidia.com/spark-rapids/user-guide/latest/qualification/quickstart.html#qualification-output
    - Summarized savings and speedups CSV report: /path/qual_20240723201337_7dCB4E86/qualification_summary.csv
    - Full savings and speedups CSV report: /path/qual_20240723201337_7dCB4E86/qualification_summary_full.csv
    - Intermediate output generated by tools: /path/qual_20240723201337_7dCB4E86/intermediate_output
+----+-------------------------+-------------------------+-----------------+------------------+------------------------------+-----------------------------+
|    | App Name                | App ID                  | Estimated GPU   | Qualified Node   | Full Cluster                 | GPU Config                  |
|    |                         |                         | Speedup         | Recommendation   | Config                       | Recommendation              |
|    |                         |                         | Category**      |                  | Recommendations*             | Breakdown*                  |
|----+-------------------------+-------------------------+-----------------+------------------+------------------------------+-----------------------------|
|  1 | test_app_11111          | app-20240311195738-0000 | Medium          | Not Available    | app-20240311195738-0000.conf | app-20240311195738-0000.log |
|  0 | test_app_22222          | app-20240311074805-0000 | Medium          | Not Available    | app-20240311074805-0000.conf | app-20240311074805-0000.log |
+----+-------------------------+-------------------------+-----------------+------------------+------------------------------+-----------------------------+
* Config Recommendations can be found in /path/qual_20240723201337_7dCB4E86/rapids_4_spark_qualification_output/tuning
** Estimated GPU Speedup Category assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

Report Summary:
------------------  -
Total applications  4
Top candidates      2
------------------  -

After: Correct Node Recommendation

    9 directories, 87 files
    - To learn more about the output details, visit https://docs.nvidia.com/spark-rapids/user-guide/latest/qualification/quickstart.html#qualification-output
    - Summarized savings and speedups CSV report: /path/qual_20240723160911_B043Bae0/qualification_summary.csv
    - Full savings and speedups CSV report: /path/qual_20240723160911_B043Bae0/qualification_summary_full.csv
    - Intermediate output generated by tools: /path/qual_20240723160911_B043Bae0/intermediate_output
    - Metadata for the summary report: /path/qual_20240723160911_B043Bae0/qualification_summary_metadata.json
    - Config Recommendations: /path/qual_20240723160911_B043Bae0/rapids_4_spark_qualification_output/tuning
+----+-------------------------+-------------------------+-----------------+---------------------------+------------------------------+-----------------------------+
|    | App Name                | App ID                  | Estimated GPU   | Qualified Node            | Full Cluster                 | GPU Config                  |
|    |                         |                         | Speedup         | Recommendation            | Config                       | Recommendation              |
|    |                         |                         | Category**      |                           | Recommendations*             | Breakdown*                  |
|----+-------------------------+-------------------------+-----------------+---------------------------+------------------------------+-----------------------------|
|  1 | test_app_11111	       | app-20240311195738-0000 | Medium          | r5d.2xlarge to g5.2xlarge | app-20240311195738-0000.conf | app-20240311195738-0000.log |
|  0 | test_app_22222          | app-20240311074805-0000 | Medium          | r5d.2xlarge to g5.2xlarge | app-20240311074805-0000.conf | app-20240311074805-0000.log |
+----+-------------------------+-------------------------+-----------------+---------------------------+------------------------------+-----------------------------+

Report Summary:
------------------  -
Total applications  4
Top candidates      2
------------------  -

Notes:
--------------------
 - 'Estimated GPU Speedup Category' assumes the user is using the node type recommended and config recommendations with the same size cluster as was used with the CPU side.

@parthosa
Copy link
Collaborator Author

parthosa commented Jul 23, 2024

Thanks @amahussein.

If we are generating a new file qualification_summary_metadata.json do we get rid of the previous file that was showing source/target clusters?

Yes, the previous file has been removed since it contained only cluster information.

I am not 100% yet on what would be that file name. too many qualification_summary* on the root level.

As part of #1099, we should get rid of summary csv grouped by name. In that case we will have only two files: qualification_summary.csv and qualification_summary_metadata.json at root level.

in the PR description, I cannot see the same notes in "before" and "after"

There seems to be some rendering issue in Github markdown due to special characters. Added it as a separate comment above.

@parthosa parthosa requested a review from amahussein July 23, 2024 23:32
@amahussein amahussein merged commit c2af6e8 into NVIDIA:dev Jul 25, 2024
14 checks passed
@tgravescs
Copy link
Collaborator

filed #1239 to followup on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) bug Something isn't working feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
4 participants