Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create certain GCP GPU instances #1398

Open
awendel-presien opened this issue Jul 3, 2023 · 11 comments
Open

Cannot create certain GCP GPU instances #1398

awendel-presien opened this issue Jul 3, 2023 · 11 comments
Assignees
Labels
bug Something isn't working cloud-gcp Google Cloud

Comments

@awendel-presien
Copy link

Hi everyone,

I'm getting the following error when trying to create A100 or L4 based instances on GCP using cml runner launch (a2-highgpu and g2-standard types respectively):

***"level":"error","message":"terraform error: Error: Failed creating the machine: googleapi: Error 400: Instances with guest accelerators do not support live migration., badRequest"***

I have no problem creating V100 and T4 instances (both n1 types).

I have found this discussion, which suggests the maintenance policy needs to be set to TERMINATE. Am I on the right track, and if yes, is there a way to do that using cml runner launch?

Regards,
Alex.

@0x2b3bfa0
Copy link
Member

Hello, @awendel-presien! It looks like the instance maintenance behavior is already being set to TERMINATE when creating GPU instances. 🤔

@awendel-presien
Copy link
Author

Hello, @awendel-presien! It looks like the instance maintenance behavior is already being set to TERMINATE when creating GPU instances. 🤔

Hi @0x2b3bfa0, thanks for having a look at this! Do you have any other ideas as to why it might return that error for the newer a2-highgpu and g2-standard instance types?

@awendel-presien
Copy link
Author

awendel-presien commented Jul 4, 2023

@0x2b3bfa0, unfortunately this error persists for us. As a test, I tried creating a g2-standard-4 instance using Terraform and the Iterative Terraform provider (so not using CML), and that worked without issue.

So the problem only occurs when trying to start g2 or a2 instances using cml runner launch.

Any ideas?

@awendel-presien
Copy link
Author

We managed to get this working by including the GPU type and number in the --cloud-type option, e.g. g2-standard-96+nvidia-l4*8 instead of g2-standard-96.

I think this is something that should at least be documented, because it is technically superfluous; i.e. g2-standard-96 instances only come with 8x Nvidia L4 GPUs . It's the same with a2 instances; for example a2-highgpu-8g only comes with 8x A100 GPUs.

It's also not necessary to specify the number and type of GPUs when using the Terraform Provider Iterative directly - it works just fine when only providing the machine type. And cml runner launch does not require this for AWS instances; for example you can launch a g4dn.metal instance without specifying the type and number of GPUs.

@hopeai
Copy link

hopeai commented Jul 18, 2023

Hi @awendel-presien,

Did you manage to run any a2-highgpu using cml runner launch ? I am getting the same error when I set --cloud-type=a2-highgpu-1g .

@dacbd
Copy link
Contributor

dacbd commented Jul 18, 2023

@hopeai can you try as a2-highgpu+nvidia-a100*1 or a2-highgpu+nvidia-tesla-a100*1 we'll see if we can address this in the near future. In the past you had to select GPUs and the gcp types didn't have preselected gpus options, like for example with the aws image types.

@hopeai
Copy link

hopeai commented Jul 19, 2023

Thanks @dacbd, I was able to solve this problem by setting --cloud-type=a2-highgpu-1g+nvidia-tesla-a100*1 . BTW, how do you deal with resource availability problem. Is there a plan to address this in the near future.

error: terraform error: Error: Failed creating the machine: Operation error: compute.OperationErrorErrors{Code:"ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS", ErrorDetails:[]*compute.OperationErrorErrorsErrorDetails{(*compute.OperationErrorErrorsErrorDetails)(0xc000336870), (*compute.OperationErrorErrorsErrorDetails)(0xc000336960), (*compute.OperationErrorErrorsErrorDetails)(0xc000336cd0)}, Location:"", Message:"The zone 'projects/MY_PROJECT/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'.", ForceSendFields:[]string(nil), NullFields:[]string(nil)}

@dacbd
Copy link
Contributor

dacbd commented Jul 19, 2023

My recommendation if its something that you encounter often would be to try some kind of simple bash loop, something like this:

zones=("us-central1-a", "us-central1-b", "us-central1-c")
for zone in "{zones[@]}"; do
    cml runner launch ... \
        --region="$zone" \
        ...
    if [ $? -eq 0 ]; then
          echo "deploy runner in $zone"
          break
    else
          echo "Runner in $zone failed, trying next zone"
    fi
done

(I haven't explicitly tested the above)


@hopeai we aren't doing much active development on CML for the moment, but if you want to add this feature yourself, I'd be happy to prioritize testing any pull requests you make, and releasing any new additions.

@hopeai
Copy link

hopeai commented Jul 20, 2023

My recommendation if its something that you encounter often would be to try some kind of simple bash loop, something like this:

zones=("us-central1-a", "us-central1-b", "us-central1-c")
for zone in "{zones[@]}"; do
    cml runner launch ... \
        --region="$zone" \
        ...
    if [ $? -eq 0 ]; then
          echo "deploy runner in $zone"
          break
    else
          echo "Runner in $zone failed, trying next zone"
    fi
done

(I haven't explicitly tested the above)

@hopeai we aren't doing much active development on CML for the moment, but if you want to add this feature yourself, I'd be happy to prioritize testing any pull requests you make, and releasing any new additions.

Thanks for the recommendation @dacbd. At the moment I'm using a similar bash loop, but I'd like to know if this is something that will be addressed in cml runner launch it could be a --cloud-region-list option or --cloud-region can accept more than one region to try.

@Arslan-Mehmood1
Copy link

check quotas of your gcp account and try to provision resources accordingly via cml runner

@mehadi92
Copy link

mehadi92 commented Sep 9, 2024

Any update regarding this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cloud-gcp Google Cloud
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants