Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add Estimation Model to Qualification CLI #870

Merged
merged 3 commits into from
Mar 25, 2024

Conversation

amahussein
Copy link
Collaborator

@amahussein amahussein commented Mar 22, 2024

Signed-off-by: Ahmed Hussein (amahussein) [email protected]

Fixes #869

  • Add estimation_model to qualification arguments
  • Refactor the job sumbission to run as concurrent processes as we need to run both qualification and profiling tool in the xgboost model
  • Remove --per-sql from the allowed list of qualification tool because it is used in the XGBOOST model
  • Import qualx code into user_tools repo
  • Running qual CLI with --estimation_model XGBOOST runs the prediction model and generate the results as intermediate output, but it won't affect the final results
  • There are 2 files generated inside the final output directory qual_*_app.csv and qual_*_sql

Notes:

  • The estimation model uses on-prem for now
  • The remaining work will be WIP through tasks listed in issue-806
    • we want to extract the readDataFormat from the profiler output of each application
    • This probably will be a new class to hold appMetadata
    • Modify the modeling code to process one app at a time. Depending on the metadata, the prediction model will be loaded.
    • Prediction model uses information from both Profiler and Qual CSV files. So, we will need to handle errors that could be raised from applications that do not exist in both tools outputs.
    • The new Speedups are generated. Then we override the original Speedup estimation with the Qx Prediction, “estimated_df”
    • The “estimated_df” should be similar to the legacy qual DF.
    • The report generation (stdout+csv file) code won’t need to be changed.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fixes NVIDIA#869

- Add `estimation_model` to qualification arguments
- Refactor the job sumbission to run as concurrent processes as we need
  to run both qualification and profiling tool in the xgboost model
- Remove `--per-sql` from the allowed list of qualification tool because
  it is used in the XGBOOST model
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
@amahussein amahussein added feature request New feature or request core_tools Scope the core module (scala) labels Mar 22, 2024
@amahussein amahussein self-assigned this Mar 22, 2024
@amahussein amahussein changed the title [FEA] Add estimationModel to Qualification CLI [FEA] Add Estimation Model to Qualification CLI Mar 22, 2024
@@ -92,6 +93,10 @@ def qualification(cpu_cluster: str = None,
"MATCH": keep GPU cluster same number of nodes as CPU cluster;
"CLUSTER": recommend optimal GPU cluster by cost for entire cluster;
"JOB": recommend optimal GPU cluster by cost per job.
:param estimation_model: Model used to calculate the estimated GPU duration and cost savings.
It accepts one of the following:
"XGBOOST": an XGBoost model for GPU duration estimation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor preference to use lowercase for argument values: XGBOOST --> xgboost

Copy link
Collaborator Author

@amahussein amahussein Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mattahrens.

Done!
The CLI handles both lower/upper-cases. I changed the comments to lower-case which reflects on the output of the --help command.

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein.

@amahussein amahussein merged commit e005165 into NVIDIA:dev Mar 25, 2024
13 checks passed
@amahussein amahussein deleted the spark-rapids-tools-869 branch March 25, 2024 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Port prediction code into user_tools and support estimation argument
3 participants