[FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that #581

viadea · 2023-09-21T18:45:56Z

I wish Qualification tool can detect the CPU jobs' cluster shape and then provide the suggestion based on that.

Currently the qualification tool as designed uses a single cluster shape as input for the set of logs it is analyzing. the user would have to run the qual tool separately for the batch of logs for each unique cluster shape.

A common scenarios is:
The user who run the Qualification tool may not be the jobs' owner, as a result it is difficult for them to firstly split the jobs based on the cluster shape into different batches.
They just want to run the Qualification tool on all of the jobs at once.

If the Qualification tool can detect the worker node information from the each individual event log, then we do not need the cluster shape information as the input.
For example, at least from the Databricks event logs, it has the worker type information.

Tasks

Give feedback

Generate cluster information from event logs in Qualification tool #789

core_tools feature request
Estimate cluster instances and generate cost savings #803

feature request user_tools
[FEA] Calculate cost-savings based on cluster-shape per eventlog #809

feature request user_tools
Options

mattahrens · 2023-10-23T20:28:19Z

@viadea is the main problem that we want to solve with this issue is that a customer isn't able to provide CPU cluster shape for cost estimation purposes? Or is it something else?

viadea · 2023-10-26T14:59:22Z

The main problem that the event logs from customer are not not based on a single cluster shape, so they need to remove the --cpu-cluster option or directly using jar version.

mattahrens · 2023-10-27T12:52:10Z

To be clear, the CPU cluster shape is not used in the speedup estimation, but only with the cost estimation. So we could try to infer the instance type from the executor information in the event log, but that would only impact the cost estimation for projected CPU cluster shape (and subsequent GPU cluster shape).

mattahrens · 2023-11-02T15:38:28Z

Draft of scope and requirements:

If the customer does not supply a cluster shape, we can infer the cluster shape based on the executor instances and cores.
For a given environment (Dataproc), we will have a default instance type that has a specified number of cores. For example, on Dataproc, we can use e2-standard-32 which has 32 cores.
We will calculate the total cluster cores from the event log by multiplying the executor instances by the executor cores as visible in the application event log.
The cluster shape in terms of number of workers will be calculated by dividing the total cluster cores by the default cores (for Dataproc, that would be 32 cores).
The cluster shape then would use the default instance type and number of workers to estimate the CPU cluster shape and subsequent cost.

For other platforms such as Databricks, the instance type may be represented in the event log and could be used in place of a default for (2).

This path of execution can be off by default but triggered by a flag such as infer_cluster.

parthosa · 2023-12-13T00:05:31Z

Divided in two parts:

The first component introduces the inference logic in qualification tools for single event log.
From offline discussion with @viadea, we can process multiple event logs and generate a table based output for cluster migration and cost savings.

parthosa · 2024-02-12T22:44:45Z

Here is a design overview for this feature:

Design:

Core Tools:
1. In EventProcessor: Collect number of executor nodes, num of cores. executor instance type (only available in databricks), driver instance type (only available in databricks).
2. Write these as cluster information csv file.
User Tools:
1. Read cluster information csv file.
2. If cluster information is available, construct a Cpu Cluster object and set savings flag and cpu cluster context.
3. Rest of the flow should be same.

Implementation Details:

Construction of Cpu Cluster object in User Tools:

In each platform's config, a default json template for cluster information (output from describe cmd) will be maintained. We will fetch this default json template and update executor/driver fields using the above cluster information.

For example,

  "defaultClusterConfig": {
       "cluster_id": "1234-5678-test",
       "cluster_name": "default-cluster-prop",
       "driver_node_type_id": "m6gd.xlarge",
       "node_type_id": "m6gd.2xlarge",
       "num_workers": 1,
       "state": "TERMINATED"
     },

In each platform's config, a mapping of core count to instance type will be maintained. We will select the appropriate executor instance based on the number of cores.

For example,

"defaultCpuInstances": {
       "driver": "m6gd.xlarge",
       "executor": [
         {"name": "m6gd.large", "vCPUs": 2},
         {"name": "m6gd.xlarge", "vCPUs": 4},
         {"name": "m6gd.2xlarge", "vCPUs": 8},
         {"name": "m6gd.4xlarge", "vCPUs": 16},
         {"name": "m6gd.8xlarge", "vCPUs": 32},
         {"name": "m6gd.12xlarge", "vCPUs": 48},
         {"name": "m6gd.16xlarge", "vCPUs": 64}
       ]
     }

Method

I plan to divide in two tasks:

Generate cluster information in Core Tools.
Construct cpu cluster from the generated file in User Tools.

mattahrens · 2024-02-13T15:40:16Z

This looks great. One consideration -- how could we also support different instance type families for a given CSP? Is it possible to see executor memory to find out if instance if highmem or normal or even high disk or normal?

mattahrens · 2024-02-13T15:42:11Z

Also -- can we use a JSON or YAML format instead of CSV to pass between core tools and user tools for this? Seems like that will be easier to maintain.

parthosa · 2024-02-13T19:35:18Z

One consideration -- how could we also support different instance type families for a given CSP?

We can use (numCores, memory) as keys to look up multiple series of instance type.

an we use a JSON or YAML format instead of CSV to pass between core tools

I selected CSV because of two reasons (1) customers can view the cluster inference file and verify (2) existing output were based on csv format. However, based on discussion with @amahussein, we decided to store cluster information in JSON format as it will be simpler to parse JSON in user_tools.

Sample JSON output:

{
  "app-001": {
    "appID": "app-001",
    "appName": "abc",
    "eventlog": "path",
    "cluster": {
      // cluster properties here
    }
  },
  "app-002": {
    "appID": "app-002",
    "appName": "abc",
    "eventlog": "path",
    "cluster": {
      // cluster properties here
    }
  },
}

tgravescs · 2024-07-02T18:31:57Z

note some new cluster node recommendations are happening in #1160 so this should wait for that use that node recommendations

amahussein · 2024-08-09T18:47:20Z

This completed as #1160 is merged and cost savings are turned-off.

viadea added feature request New feature or request ? - Needs Triage labels Sep 21, 2023

mattahrens removed the ? - Needs Triage label Sep 28, 2023

mattahrens added the core_tools Scope the core module (scala) label Nov 13, 2023

parthosa self-assigned this Nov 22, 2023

mattahrens changed the title ~~[FEA] Qualification tool can detect the CPU jobs' cluster shape and then provide the suggestion based on that~~ [FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that Nov 29, 2023

parthosa mentioned this issue Dec 12, 2023

Infer cluster shape from single event log in Qualification tool #687

Closed

parthosa linked a pull request Dec 15, 2023 that will close this issue

Infer cluster shape from single event log in Qualification tool #687

Closed

amahussein mentioned this issue Dec 18, 2023

[FEA] Qualification tool can map spark configs for GPU migration #697

Closed

amahussein mentioned this issue Feb 7, 2024

[FEA] Support for Dataproc Serverless platform in spark-rapids-user-tools #682

Open

parthosa mentioned this issue Feb 14, 2024

Generate cluster information from event logs in Qualification tool #789

Merged

parthosa linked a pull request Feb 14, 2024 that will close this issue

Generate cluster information from event logs in Qualification tool #789

Merged

amahussein closed this as completed in #789 Feb 22, 2024

parthosa reopened this Feb 22, 2024

amahussein reopened this Feb 22, 2024

parthosa mentioned this issue Feb 23, 2024

Estimate cluster instances and generate cost savings #803

Merged

parthosa linked a pull request Feb 23, 2024 that will close this issue

Estimate cluster instances and generate cost savings #803

Merged

amahussein closed this as completed in #803 Feb 27, 2024

parthosa reopened this Feb 27, 2024

amahussein closed this as completed Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that #581

[FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that #581

viadea commented Sep 21, 2023 •

edited by parthosa

Loading

Tasks

mattahrens commented Oct 23, 2023

viadea commented Oct 26, 2023

mattahrens commented Oct 27, 2023

mattahrens commented Nov 2, 2023 •

edited

Loading

parthosa commented Dec 13, 2023 •

edited

Loading

parthosa commented Feb 12, 2024 •

edited

Loading

mattahrens commented Feb 13, 2024

mattahrens commented Feb 13, 2024

parthosa commented Feb 13, 2024 •

edited

Loading

tgravescs commented Jul 2, 2024

amahussein commented Aug 9, 2024

[FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that #581

[FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that #581

Comments

viadea commented Sep 21, 2023 • edited by parthosa Loading

Tasks

mattahrens commented Oct 23, 2023

viadea commented Oct 26, 2023

mattahrens commented Oct 27, 2023

mattahrens commented Nov 2, 2023 • edited Loading

parthosa commented Dec 13, 2023 • edited Loading

parthosa commented Feb 12, 2024 • edited Loading

Design:

Implementation Details:

Construction of Cpu Cluster object in User Tools:

Method

mattahrens commented Feb 13, 2024

mattahrens commented Feb 13, 2024

parthosa commented Feb 13, 2024 • edited Loading

tgravescs commented Jul 2, 2024

amahussein commented Aug 9, 2024

viadea commented Sep 21, 2023 •

edited by parthosa

Loading

mattahrens commented Nov 2, 2023 •

edited

Loading

parthosa commented Dec 13, 2023 •

edited

Loading

parthosa commented Feb 12, 2024 •

edited

Loading

parthosa commented Feb 13, 2024 •

edited

Loading