Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP:GPU] Unable to Provision VMs with GPU Accelerators #1125

Closed
powerkimhub opened this issue Mar 22, 2024 · 5 comments
Closed

[GCP:GPU] Unable to Provision VMs with GPU Accelerators #1125

powerkimhub opened this issue Mar 22, 2024 · 5 comments
Assignees
Labels
bug Something isn't working CloudDriver

Comments

@powerkimhub
Copy link
Member

from) #1124


[현황]

  • 현재 최신 Spider 버전(v0.8.9)은 GCP의 경우 GPU-VM(GPU를 포함하는 VM) 배포시 다음 오류가 발생합니다.
    {"message":"googleapi: Error 400: Instances with guest accelerators do not support live migration., badRequest"}
    

[사유]

  • VM은 Maintenance 옵션으로 기본 배포되며, 이 경우 live migration이 가능하도록 배포되는데
  • GPU-VM은 live migration이 불가하여 GPU-VM 배포 시에는 VM을 Maintenance Off 설정으로 배포해야 함

[방안]

  • 현재 live migration이 적극 사용되지 않으니 다음처럼 임시 코드 블록을 추가하여
  • VM live migration 기능을 차단 시킨 버전(v0.8.10)을 사용할 수 있도록 빠르게 배포하였습니다.

  • 그 사이,
  • 드라이버팀에서 정식 버전으로 Patch 부탁드립니다.

    • 임시 Patch 참고: 841ec6c
    • 현재는 모든 VM 배포시 Maintenance를 Off시킨 상태이어서, GPU-VM일 경우에만 Off 설정으로 반영 및 시험이 필요합니다.
  • 참고: StartVM() 내부에서 GPU-VM 요청인 걸 판단할 수 있으면 Best인데, 대안으로 다음과 같은 방법도 고려해봐 주시기 바랍니다.
    • (1) 기존처럼 별도 설정 없는 Maintenance 설정으로 GPU-VM 배포 시도,
    • (2) GPU-VM의 경우 '~not support live migration' 에러가 발생
    • (3) '~not support live migration' 오류일 경우에만, Maintenance Off 설정으로 GPU-VM을 재배포 시도
      • 다른 오류 발생시: error 반환 및 종료


@seokho-son


[GCP GPU-VM 활용 방법]

  • CB-Spider v0.8.10 이상을 사용합니다.
  • 다음 VM Spec 중 하나를 선택합니다.
    image

[시험 환경 및 현황]

  • Version: CB-Spider v0.8.10
  • Region: us-central1 / us-central1-a (Ohio)
  • Image: https://www.googleapis.com/compute/v1/projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20240319
  • Spec: a2-highgpu-1g
  • GPU 부착 확인: VM 로그인 후 다음 실행
    lspci |grep NVIDIA
    00:04.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 40GB] (rev a1)
    
  • 이후 NVIDIA Driver, SDK 등 설치 후 사용 가능할 것으로 보입니다.
@powerkimhub
Copy link
Member Author

@seokho-son @yunkon-kim (cc: @hippo-an)


[요약]

  • 아래 테이블 [Live Migration 지원 불가 케이스] 중에

  • 현재 Spider가 제공하는 VM 유형의 배포 요청이 입력되는 경우

  • Live Migration Off 옵션으로 설정 후 배포하도록 Patch됨

    [Live Migration 지원 불가 케이스]

    VM 유형 CB-Spider 지원 가능 대상 관련 Spec
    컨피덴셜 VM (AMD N2D 아닌 경우) X -
    GPU 연결된 VM ✔️ a2-highgpu-1g, ...
    Cloud TPU X -
    선점형 VM X -
    Spot VM X -
    스토리지 최적화 VM (Z3) ✔️ z3-highmem-88, ...
  • ※ 현재 Spider를 통해서는 n1 + gpu 추가 설정, Spot VM 등과 같은 VM 배포 요청은 불가

@seokho-son
Copy link
Member

@powerkimhub @hippo-an

gcp+us-central1 등

specId: a2-highgpu-1g

  • A100 고급 GPU

로 VM 생성 요청시, GCP의 쿼터 부족이나 가용자원 부족 등의 추정 이슈로, VM이 생성되지 않는 경우들이 많습니다.

이 경우, CB-Spider에서는 정상적인 오류를 리턴하지 않고,
VM 생성 여부만 수차례 체크하며 팬딩하다가, instance was not found 로 오류를 리턴합니다.

현황 파악 및 오류 메시지 개선이 필요한 상황으로 보입니다.

[CB-SPIDER].[ERROR]: 2024-06-17 04:46:05 VMManager.go:509, github.com/cloud-barista/cb-spider/api-runtime/common-runtime.StartVM() - VcpuLimitExceeded: You have requested more vCPU capacity than your current vCPU limit of 64 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
        status code: 400, request id: 6ee05d94-f601-427a-8558-6c67a8fc04ac 
[CB-SPIDER].[ERROR]: 2024-06-17 05:20:19 VPCHandler.go:439, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVPCHandler).GetVPC() - googleapi: Error 404: The resource 'projects/sean-oh-prj/global/networks/ns01-ns01-systemdefault-gcp-us-east1-cpnsf4jq10mdal01ocgg' was not found, notFound 
[CB-SPIDER].[ERROR]: 2024-06-17 05:21:13 common.go:156, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/common.GetKey() - GCP, c4240bec42480e764a4381c10c92e2ce: does not exist! 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:15 VMHandler.go:1299, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - The VM status is not [Running], so waiting for 1 second before querying. 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:17 VMHandler.go:1299, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - The VM status is not [Running], so waiting for 1 second before querying. 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:19 VMHandler.go:1299, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - The VM status is not [Running], so waiting for 1 second before querying. 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:22 VMHandler.go:1299, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - The VM status is not [Running], so waiting for 1 second before querying. 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:23 VMHandler.go:884, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).GetVMStatus() - googleapi: Error 404: The resource 'projects/sean-oh-prj/zones/us-east1-b/instances/ns01-gcp-a100-g1-1-cpnsg0bq10mdal01ocj0' was not found, notFound 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:23 VMHandler.go:1288, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - googleapi: Error 404: The resource 'projects/sean-oh-prj/zones/us-east1-b/instances/ns01-gcp-a100-g1-1-cpnsg0bq10mdal01ocj0' was not found, notFound 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:23 VMHandler.go:1299, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - The VM status is not [Running], so waiting for 1 second before querying. 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:25 VMHandler.go:884, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).GetVMStatus() - googleapi: Error 404: The resource 'projects/sean-oh-prj/zones/us-east1-b/instances/ns01-gcp-a100-g1-1-cpnsg0bq10mdal01ocj0' was not found, notFound 

@powerkimhub
Copy link
Member Author

@seokho-son


[CB-SPIDER].[ERROR]: 2024-06-17 04:46:05 VMManager.go:509, github.com/cloud-barista/cb-spider/api-runtime/common-runtime.StartVM() - VcpuLimitExceeded: You have requested more vCPU capacity than your current vCPU limit of 64 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
        status code: 400, request id: 6ee05d94-f601-427a-8558-6c67a8fc04ac 
[CB-SPIDER].[ERROR]: 2024-06-17 05:20:19 VPCHandler.go:439, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVPCHandler).GetVPC() - googleapi: Error 404: The resource 'projects/sean-oh-prj/global/networks/ns01-ns01-systemdefault-gcp-us-east1-cpnsf4jq10mdal01ocgg' was not found, notFound 
[CB-SPIDER].[ERROR]: 2024-06-17 05:21:13 common.go:156, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/common.GetKey() - GCP, c4240bec42480e764a4381c10c92e2ce: does not exist! 

이상 위 오류 메시지는 다른 오류 메시지

[CB-SPIDER].[ERROR]: 2024-06-17 05:22:15 VMHandler.go:1299, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - The VM status is not [Running], so waiting for 1 second before querying. 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:17 VMHandler.go:1299, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - The VM status is not [Running], so waiting for 1 second before querying. 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:19 VMHandler.go:1299, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - The VM status is not [Running], so waiting for 1 second before querying. 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:22 VMHandler.go:1299, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - The VM status is not [Running], so waiting for 1 second before querying. 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:23 VMHandler.go:884, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).GetVMStatus() - googleapi: Error 404: The resource 'projects/sean-oh-prj/zones/us-east1-b/instances/ns01-gcp-a100-g1-1-cpnsg0bq10mdal01ocj0' was not found, notFound 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:23 VMHandler.go:1288, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - googleapi: Error 404: The resource 'projects/sean-oh-prj/zones/us-east1-b/instances/ns01-gcp-a100-g1-1-cpnsg0bq10mdal01ocj0' was not found, notFound 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:23 VMHandler.go:1299, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).WaitForRun() - The VM status is not [Running], so waiting for 1 second before querying. 
[CB-SPIDER].[ERROR]: 2024-06-17 05:22:25 VMHandler.go:884, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).GetVMStatus() - googleapi: Error 404: The resource 'projects/sean-oh-prj/zones/us-east1-b/instances/ns01-gcp-a100-g1-1-cpnsg0bq10mdal01ocj0' was not found, notFound 

  • 위 이슈 연관 에러 로그는 다음과 같은 3가지 이슈가 존재하였으며, 행사 위해 아래 처럼 직반영하였습니다.
  • @MZC-CSC @hippo-an @CliffSynn : 반영 내용 다른 부분 영향 없는지 검토 부탁드립니다.

(1) WaitForRun() 발생 에러 메시지

  • VM 생성 과정에서 발생하는 정상 로그 메시지(error 보다는 info 메시지로 변경 필요, 변경하지는 않음)
  • 현재는 아래 (3) 반영으로 출력되지 않는 것 같음

(2) GetVMStatus() 발생 에러 메시지

  • Spider 등 driver 상부에서 VM 생성전에도 VM 상태를 묻는 경우가 있음
  • 이경우, Driver에서는 없는 VM에 대한 상태 조회에 대한 에러 log 출력으로 정상적인 상황
    => 에러 로그가 너무 많이 발생하고, 상부에서 필요시 error를 올려 받아서 log 메시지로 출력할 것이므로,
    not found 경우에는 driver에서는 이 경우 에러 로그 생략하도록 수정함(GCP, Mock의 GetVM, GetVMStatus만 반영)

(3) 쿼터 또는 자원 부족시 발생할 수 있는 Error Notification 타입 무인식

  • 현재, GCP SDK API가 반환하는 에러로는 에러 발생 상황을 인지하지 못하여,
  • VM 생성이 완료되기를 반복 호출하다가 timeout 에러 처리됨
    => Operations 실행(VM 생성 호출) 상태를 polling 하도록 반영함

@powerkimhub
Copy link
Member Author

@seokho-son

  • 반영 버전 실행 로그 및 에러 메시지 참고

[API 에러 메시지] GCP 한정 메시지

curl -sX POST http://localhost:1024/spider/vm \
    -H 'Content-Type: application/json' \
    -d '{
        "ConnectionName" : "gcp-iowa-config",
        "ReqInfo" : {
            "Name" : "vm-01",
            "ImageType" : "PublicImage",
            "ImageName" : "https://www.googleapis.com/compute/v1/projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20191024",
            "VMSpecName" : "a2-highgpu-1g",
            "VPCName" : "vpc-01",
            "SubnetName" : "subnet-01",
            "SecurityGroupNames" : ["sg-01"],
            "RootDiskType" : "default",
            "DataDiskNames" : [],
            "KeyPairName" : "keypair-01",
            "VMUserId" : "Administrator",
           ... 생략 ...
        }
    }'

   ==> {"message":"Operation errors: The zone 'projects/powerkimhub/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'."}

[출력 로그]

[CB-SPIDER].[ERROR]: 2024-06-18 00:57:14 VMHandler.go:490, github.com/cloud-barista/cb-spider/cloud-control-manager/cloud-driver/drivers/gcp/resources.(*GCPVMHandler).StartVM() - Operation error: The zone 'projects/powerkimhub/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'.
[CB-SPIDER].[ERROR]: 2024-06-18 00:57:14 VMManager.go:509, github.com/cloud-barista/cb-spider/api-runtime/common-runtime.StartVM() - Operation errors: The zone 'projects/powerkimhub/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'.

@seokho-son
Copy link
Member

@powerkimhub 신속한 지원 감사합니다!!!!!!!! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CloudDriver
Projects
None yet
Development

No branches or pull requests

5 participants