Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add internal CLI to generate instance descriptions for CSPs #1137

Merged
merged 16 commits into from
Jul 15, 2024

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Jun 25, 2024

Fixes #1123.

This PR is first step to remove dependency on CSP CLIs. Ideally after we added this CLI, we can generate instance description json files for each platform and store them as resources in tools repo. Then we can add logic to use these instance description files when running tools.

Changes

Added an internal CLI spark_rapids_dev generate_instance_description [options].

Options:

  • platform: currently accepts emr, dataproc and databricks-azure. Databricks-aws platform can reuse the file generated by emr, and dataproc-gke can reuse the one for dataproc.
  • output_folder: directory path for the output file.

The generated json file has the following format (which is inspired by EMR CLI output):

{
  "instance_name": {
     "VCpuCount": xxx,
     "MemoryInMB": xxx,
     "GpuInfo": [
       {
         "Name": "xxx",
         "Count": xxx
       }
     ]
   },
  ...
}

For CPU instance, the entry will not have "GpuInfo".

Example json file entry for EMR platform
  "g5.4xlarge": {
    "VCpuCount": 16,
    "MemoryInMB": 65536,
    "GpuInfo":  [
      {
        "Name": "A10G",
        "Count": 1
      }
    ]
  }

Testing

spark_rapids_dev generate_instance_description --platform emr/dataproc/databricks-azure

@cindyyuanjiang cindyyuanjiang self-assigned this Jun 25, 2024
@cindyyuanjiang cindyyuanjiang added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Jun 25, 2024
@cindyyuanjiang
Copy link
Collaborator Author

I want to discuss: is it a good approach to add a new spark_rapids_dev CLI or should we keep it under spark_rapids?

@mattahrens
Copy link
Collaborator

I would rather keep it under the spark_rapids CLI instead of adding a new one. Is there a way to add the command without exposing it to users since it should be internal?

@tgravescs
Copy link
Collaborator

what would be the goal of keeping it under the same cli an end user would use without having a useful info message for anyone including dev to see?

@amahussein
Copy link
Collaborator

I would rather keep it under the spark_rapids CLI instead of adding a new one. Is there a way to add the command without exposing it to users since it should be internal?

@mattahrens
In our CLI, it is not possible to hide a cmd/argument.

@amahussein
Copy link
Collaborator

I want to discuss: is it a good approach to add a new spark_rapids_dev CLI or should we keep it under spark_rapids?

Thanks @cindyyuanjiang !
I will take a look at the changes.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang !
Overall, this looks going in the right direction.
I may need to take a deeper look to see if we can use more implementations from the spark_rapids_tools package.

@mattahrens
Copy link
Collaborator

I would rather keep it under the spark_rapids CLI instead of adding a new one. Is there a way to add the command without exposing it to users since it should be internal?

@mattahrens In our CLI, it is not possible to hide a cmd/argument.

OK, then we can have a dev CLI then for separation.

@cindyyuanjiang cindyyuanjiang marked this pull request as ready for review June 26, 2024 20:37
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang.

Should we use a class to encapsulate the instance description structure instead of using hardcoded dictionary keys? This could improve maintainability across multiple files.

{
    "MemoryInfo": {
      "SizeInMiB": 49152
    },
    "GpuInfo": {
      "GPUs": [
        {
          "Name": "NVIDIA",
          "Manufacturer": "NVIDIA",
          "Count": 1,
          "MemoryInfo": {}
        }
      ]
    },
    "VCpuInfo": {
      "DefaultVCpus": 12
    }
}

@cindyyuanjiang cindyyuanjiang dismissed stale reviews from amahussein and parthosa via 1373aa8 July 12, 2024 21:07
@cindyyuanjiang
Copy link
Collaborator Author

Added Gpu info for n1-standard series for Dataproc. New entry looks like:

"n1-standard-16": {
    "VCpuCount": 16,
    "MemoryInMB": 61440,
    "GpuInfo": [
      {
        "Name": "T4",
        "Count": [
          1,
          2,
          4
        ]
      },
      {
        "Name": "P4",
        "Count": [
          1,
          2,
          4
        ]
      },
      {
        "Name": "V100",
        "Count": [
          2,
          4,
          8
        ]
      },
      {
        "Name": "P100",
        "Count": [
          1,
          2,
          4
        ]
      }
    ]
  }

cc: @tgravescs @amahussein

Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently n1-standard have array type for GpuInfo.Count whereas it is an int for others. Do we want to keep this separate for any special handling for n1-standard?

Example,

"n1-standard-64": {
    "VCpuCount": 64,
    "MemoryInMB": 245760,
    "GpuInfo": [
      {
        "Name": "T4",
        "Count": [
          4
        ]
      },
      {
        "Name": "P4",
        "Count": [
          4
        ]
      }
    ]
  },

vs

  "g2-standard-4": {
  "VCpuCount": 4,
  "MemoryInMB": 16384,
  "GpuInfo": [
    {
      "Name": "L4",
      "Count": 1
    }
  ]
},

@cindyyuanjiang
Copy link
Collaborator Author

Thanks @parthosa! I have updated the GPU Count field to a list for consistency across instances and platforms.

amahussein
amahussein previously approved these changes Jul 15, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang !
LGTME

parthosa
parthosa previously approved these changes Jul 15, 2024
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang. LGTM

"GpuInfo": [
{
"Name": gpu_name,
"Count": 000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment needs updated to be array

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated.

Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit on the comment, otherwise looks great. Thanks Cindy!

parthosa
parthosa previously approved these changes Jul 15, 2024
tgravescs
tgravescs previously approved these changes Jul 15, 2024
Signed-off-by: cindyyuanjiang <[email protected]>
@cindyyuanjiang cindyyuanjiang dismissed stale reviews from tgravescs and parthosa via b8ecc9d July 15, 2024 18:41
@amahussein amahussein merged commit c580851 into NVIDIA:dev Jul 15, 2024
14 checks passed
@cindyyuanjiang cindyyuanjiang deleted the spark-rapids-tools-1123 branch July 15, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an internal CLI to generate instance type descriptions for CSPs
6 participants