Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that #581

Closed
3 tasks done
viadea opened this issue Sep 21, 2023 · 11 comments · Fixed by #789 or #803
Closed
3 tasks done
Assignees
Labels
core_tools Scope the core module (scala) feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Sep 21, 2023

I wish Qualification tool can detect the CPU jobs' cluster shape and then provide the suggestion based on that.

Currently the qualification tool as designed uses a single cluster shape as input for the set of logs it is analyzing. the user would have to run the qual tool separately for the batch of logs for each unique cluster shape.

A common scenarios is:
The user who run the Qualification tool may not be the jobs' owner, as a result it is difficult for them to firstly split the jobs based on the cluster shape into different batches.
They just want to run the Qualification tool on all of the jobs at once.

If the Qualification tool can detect the worker node information from the each individual event log, then we do not need the cluster shape information as the input.
For example, at least from the Databricks event logs, it has the worker type information.

Tasks

  1. core_tools feature request
    parthosa
  2. feature request user_tools
    parthosa
  3. feature request user_tools
    amahussein
@mattahrens
Copy link
Collaborator

@viadea is the main problem that we want to solve with this issue is that a customer isn't able to provide CPU cluster shape for cost estimation purposes? Or is it something else?

@viadea
Copy link
Collaborator Author

viadea commented Oct 26, 2023

The main problem that the event logs from customer are not not based on a single cluster shape, so they need to remove the --cpu-cluster option or directly using jar version.

@mattahrens
Copy link
Collaborator

To be clear, the CPU cluster shape is not used in the speedup estimation, but only with the cost estimation. So we could try to infer the instance type from the executor information in the event log, but that would only impact the cost estimation for projected CPU cluster shape (and subsequent GPU cluster shape).

@mattahrens
Copy link
Collaborator

mattahrens commented Nov 2, 2023

Draft of scope and requirements:

  1. If the customer does not supply a cluster shape, we can infer the cluster shape based on the executor instances and cores.
  2. For a given environment (Dataproc), we will have a default instance type that has a specified number of cores. For example, on Dataproc, we can use e2-standard-32 which has 32 cores.
  3. We will calculate the total cluster cores from the event log by multiplying the executor instances by the executor cores as visible in the application event log.
  4. The cluster shape in terms of number of workers will be calculated by dividing the total cluster cores by the default cores (for Dataproc, that would be 32 cores).
  5. The cluster shape then would use the default instance type and number of workers to estimate the CPU cluster shape and subsequent cost.

For other platforms such as Databricks, the instance type may be represented in the event log and could be used in place of a default for (2).

This path of execution can be off by default but triggered by a flag such as infer_cluster.

@mattahrens mattahrens added the core_tools Scope the core module (scala) label Nov 13, 2023
@parthosa parthosa self-assigned this Nov 22, 2023
@mattahrens mattahrens changed the title [FEA] Qualification tool can detect the CPU jobs' cluster shape and then provide the suggestion based on that [FEA] Qualification tool can infer the CPU jobs' cluster shape and then provide the suggestion based on that Nov 29, 2023
@parthosa
Copy link
Collaborator

parthosa commented Dec 13, 2023

Divided in two parts:

  1. The first component introduces the inference logic in qualification tools for single event log.
  2. From offline discussion with @viadea, we can process multiple event logs and generate a table based output for cluster migration and cost savings.

@parthosa
Copy link
Collaborator

parthosa commented Feb 12, 2024

Here is a design overview for this feature:

Design:

  1. Core Tools:
    1. In EventProcessor: Collect number of executor nodes, num of cores. executor instance type (only available in databricks), driver instance type (only available in databricks).
    2. Write these as cluster information csv file.
  2. User Tools:
    1. Read cluster information csv file.
    2. If cluster information is available, construct a Cpu Cluster object and set savings flag and cpu cluster context.
    3. Rest of the flow should be same.

Implementation Details:

Construction of Cpu Cluster object in User Tools:

  1. In each platform's config, a default json template for cluster information (output from describe cmd) will be maintained. We will fetch this default json template and update executor/driver fields using the above cluster information.

For example,

  "defaultClusterConfig": {
       "cluster_id": "1234-5678-test",
       "cluster_name": "default-cluster-prop",
       "driver_node_type_id": "m6gd.xlarge",
       "node_type_id": "m6gd.2xlarge",
       "num_workers": 1,
       "state": "TERMINATED"
     },
  1. In each platform's config, a mapping of core count to instance type will be maintained. We will select the appropriate executor instance based on the number of cores.

For example,

"defaultCpuInstances": {
       "driver": "m6gd.xlarge",
       "executor": [
         {"name": "m6gd.large", "vCPUs": 2},
         {"name": "m6gd.xlarge", "vCPUs": 4},
         {"name": "m6gd.2xlarge", "vCPUs": 8},
         {"name": "m6gd.4xlarge", "vCPUs": 16},
         {"name": "m6gd.8xlarge", "vCPUs": 32},
         {"name": "m6gd.12xlarge", "vCPUs": 48},
         {"name": "m6gd.16xlarge", "vCPUs": 64}
       ]
     }

Method

I plan to divide in two tasks:

  1. Generate cluster information in Core Tools.
  2. Construct cpu cluster from the generated file in User Tools.

@mattahrens
Copy link
Collaborator

This looks great. One consideration -- how could we also support different instance type families for a given CSP? Is it possible to see executor memory to find out if instance if highmem or normal or even high disk or normal?

@mattahrens
Copy link
Collaborator

Also -- can we use a JSON or YAML format instead of CSV to pass between core tools and user tools for this? Seems like that will be easier to maintain.

@parthosa
Copy link
Collaborator

parthosa commented Feb 13, 2024

One consideration -- how could we also support different instance type families for a given CSP?

We can use (numCores, memory) as keys to look up multiple series of instance type.

an we use a JSON or YAML format instead of CSV to pass between core tools

I selected CSV because of two reasons (1) customers can view the cluster inference file and verify (2) existing output were based on csv format. However, based on discussion with @amahussein, we decided to store cluster information in JSON format as it will be simpler to parse JSON in user_tools.

Sample JSON output:

{
  "app-001": {
    "appID": "app-001",
    "appName": "abc",
    "eventlog": "path",
    "cluster": {
      // cluster properties here
    }
  },
  "app-002": {
    "appID": "app-002",
    "appName": "abc",
    "eventlog": "path",
    "cluster": {
      // cluster properties here
    }
  },
}

@tgravescs
Copy link
Collaborator

note some new cluster node recommendations are happening in #1160 so this should wait for that use that node recommendations

@amahussein
Copy link
Collaborator

This completed as #1160 is merged and cost savings are turned-off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment