Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AWS Inferentia chips #3095

Closed
RobertLucian opened this issue Apr 25, 2020 · 13 comments
Closed

Support AWS Inferentia chips #3095

RobertLucian opened this issue Apr 25, 2020 · 13 comments
Assignees

Comments

@RobertLucian
Copy link

It appears that even though Inferentia instances have been added to cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go#L655-L678, they still don't have support for the actual Inferentia chips.

This is what I'm getting when I try to scale:
1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"api-api-0-67b4b9665f-rw7zm", UID:"808d2f5b-4c66-47cc-af1b-d66144ce5959", APIVersion:"v1", ResourceVersion:"31577", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient hugepages-2Mi, 2 Insufficient aws.amazon.com/infa, 1 max node group size reached.

For reference, here are the instructions to get the k8s setup for neuron (Inferentia chips):
https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-container-tools/tutorial-k8s.md#steps

Is there anyone that can help us with a patch? Or at least with a bit of info as to what parts of the codebase have to be modified to accomodate Inferentia chips.

@gjtempleton
Copy link
Member

Could you tell us which version of k8s and the CA you're running?

Also, are there any Inf1 instances running at the point the CA produces this message? If so could you provide the output (appropriately obfuscated) of describing one of those nodes?

@RobertLucian
Copy link
Author

RobertLucian commented Apr 25, 2020

@gjtempleton Thanks for coming back with a reply. The version of k8s is 1.15 (the highest possible on EKS) and that of the CA is 1.18.1.

No, no Inf1 instances are running at that point. The node(s) are not up because the CA can't find any nodes matching the requested resources and so, no node can be spun up. I can force one to scale-up though and then copy-paste the output.

Here's an Inf instance described:

Name:               ip-192-168-28-191.ec2.internal
Roles:              <none>
Labels:             alpha.eksctl.io/cluster-name=cortex-dev-0
                    alpha.eksctl.io/instance-id=i-xxxxxxxxxx
                    alpha.eksctl.io/nodegroup-name=ng-cortex-worker-spot
                    aws.amazon.com/infa=true
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=inf1.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-192-168-28-191.ec2.internal
                    kubernetes.io/os=linux
                    lifecycle=Ec2Spot
                    workload=true
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sat, 25 Apr 2020 22:12:46 +0000
Taints:             aws.amazon.com/infa=true:NoSchedule
                    workload=true:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-192-168-28-191.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Sat, 25 Apr 2020 22:31:27 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sat, 25 Apr 2020 22:31:22 +0000   Sat, 25 Apr 2020 22:12:46 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sat, 25 Apr 2020 22:31:22 +0000   Sat, 25 Apr 2020 22:12:46 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sat, 25 Apr 2020 22:31:22 +0000   Sat, 25 Apr 2020 22:12:46 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sat, 25 Apr 2020 22:31:22 +0000   Sat, 25 Apr 2020 22:13:17 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   192.168.28.191
  ExternalIP:   54.211.105.103
  Hostname:     ip-192-168-28-191.ec2.internal
  InternalDNS:  ip-192-168-28-191.ec2.internal
  ExternalDNS:  ec2-xxx-yyy-zzz-ttt.compute-1.amazonaws.com
Capacity:
  attachable-volumes-aws-ebs:  39
  aws.amazon.com/infa:         1
  cpu:                         4
  ephemeral-storage:           52416492Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               256Mi
  memory:                      7864680Ki
  pods:                        38
Allocatable:
  attachable-volumes-aws-ebs:  39
  aws.amazon.com/infa:         1
  cpu:                         3700m
  ephemeral-storage:           48843279730
  hugepages-1Gi:               0
  hugepages-2Mi:               256Mi
  memory:                      6783336Ki
  pods:                        38
System Info:
  Machine ID:                 ec2fb1c151e2b48eaef60796186cc875
  System UUID:                EC2046FE-5D02-5780-B32D-3974F12DA97A
  Boot ID:                    6ef97634-6ef1-480a-a976-f267d7e09ff5
  Kernel Version:             4.14.173-137.229.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.6
  Kubelet Version:            v1.15.10-eks-bac369
  Kube-Proxy Version:         v1.15.10-eks-bac369
ProviderID:                   aws:///us-east-1a/i-xxxxxxxxxx
Non-terminated Pods:          (5 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  default                     api-api-0-6c88fb8f4-vhb2z               200m (5%)     0 (0%)      10Mi (0%)        0 (0%)         6m29s
  kube-system                 aws-node-592wk                          10m (0%)      0 (0%)      0 (0%)           0 (0%)         18m
  kube-system                 istio-cni-node-zmm97                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         18m
  kube-system                 kube-proxy-b25r8                        100m (2%)     0 (0%)      0 (0%)           0 (0%)         18m
  kube-system                 neuron-device-plugin-daemonset-rfn7c    0 (0%)        0 (0%)      0 (0%)           0 (0%)         16m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests   Limits
  --------                    --------   ------
  cpu                         310m (8%)  0 (0%)
  memory                      10Mi (0%)  0 (0%)
  ephemeral-storage           0 (0%)     0 (0%)
  attachable-volumes-aws-ebs  0          0
  aws.amazon.com/infa         1          1
Events:
  Type    Reason    Age   From                                        Message
  ----    ------    ----  ----                                        -------
  Normal  Starting  18m   kube-proxy, ip-192-168-28-191.ec2.internal  Starting kube-proxy.

Also, one thing to keep in mind is that the number of allocable hugepages-2Mi depends on the number of Inf chips the Inf1 instance has - they are, by default, set to 256/chip. In this case, there's only one chip, so there are 1 x 256 x hugepages-2Mi.

@RobertLucian
Copy link
Author

RobertLucian commented Apr 27, 2020

@gjtempleton I've done a bit more testing on this matter and here's what I've found:

  1. On the current version of EKS (1.15) and CA (1.18.1), the auto-scaling mechanisms work just fine on inf1.xlarge instances. So, whether the cluster already starts with a node or not, the auto-scaling thing works. That's good. Mind you, when I deploy a pod, I either ask for non-inf resource or inf resources.
  2. While using inf1.6xlarge instances, the auto-scaling mechanism doesn't work when there are no nodes to start with - this when I request inf resources. But if I request non-inf resources (CPU and/or memory), the cluster auto-scales and I get my node(s). Subsequent pods that require either non-inf or inf resources lead to a scale-up, so that's good. So it doesn't work when it starts at 0.

I suspect that when there is more than 1 inf chip on a given instance, the CA doesn't know how to bring that up (and I get that above error from my first message). Somehow it works when there's at least one node running. In the same vein, I suspect it's the same case for inf1.24xlarge - but I can't test on that because my quota is too low. I haven't tested on inf1.2xlarge, but I suspect it's going to work because it still only has a single inf chip as opposed to multiple.

One important observation is that in my previous node description, the log should have described an inf1.6xlarge instance and not an inf1.xlarge. The reported error is only logged when I use inf.6xlarge instances.

Maybe there's an incompatibility between CA (version 1.18.1) and EKS (version 1.15) due to their mismatch in version numbers. I know they ought to run on the same version (1.15.x on CA with 1.15 on EKS), but I really had to update it to 1.18.1 to get the support for inf instances. And EKS can't be updated to a higher version because 1.15 is already the highest. If #2550 is patched onto version 1.15.6 of CA (to get a 1.15.7), then maybe this issue will go away. Not to mention the error I keep getting on 1.18.1 about CSI nodes not existing. Do you think that would solve something? What's your take on this? What do you think there's to be done here?


Edit

I noticed that our cluster was originally running on version 1.15.5 of CA and I also observed that version 1.15.6 adds support for dynamic loading of AWS EC2 instance types, which would make the generated static file not necessary and that the AWS SDK has been updated to v1.28.14, which adds in support for inf instances.

Now, I tried version 1.15.6 and in the cluster-autoscaler logs I'm getting these dynamically loaded EC2 instances (notice the presence of inf1 instances in there):

1 aws_util.go:68] fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/us-east-1/index.json
aws_cloud_provider.go:374] Successfully load 316 EC2 Instance Types [i2.xlarge r5ad.12xlarge r5n.24xlarge r5a.12xlarge a1 m5dn.16xlarge r5dn.16xlarge a1.metal p3dn m5.8xlarge r5.2xlarge m5dn.24xlarge c1.xlarge z1d.xlarge m5d.xlarge m5dn.large r5dn.xlarge u-18tb1 m2.4xlarge m5d.24xlarge r5.large h1.16xlarge t3.2xlarge g3 i3en.24xlarge m4.4xlarge r5dn.large c5d.large z1d.3xlarge f1.16xlarge m3.2xlarge t2.xlarge r5.4xlarge c5n.xlarge c5.24xlarge c3.xlarge c5d c4.xlarge u-24tb1.metal i3en p2 m5n.xlarge h1.8xlarge t3a.xlarge m3.large m5d.4xlarge m5ad.24xlarge r5ad.large m5ad.2xlarge m5dn.4xlarge m5dn.xlarge m5a.large t2.nano c4.large m6g.large d2.xlarge r5d.8xlarge c3.8xlarge i3en.large c5d.18xlarge c5d.metal d2 c3.2xlarge t3.small t3a.small z1d.6xlarge m5.4xlarge i3.8xlarge g4dn.xlarge t2.micro a1.4xlarge t3a.medium r5.16xlarge a1.xlarge x1 c5 c3.large c1.medium c4.8xlarge r5dn.2xlarge c5.4xlarge inf1.24xlarge r4.xlarge i3en.metal g3.4xlarge cr1.8xlarge f1 c5n t3.xlarge i3en.6xlarge r5ad.24xlarge r3.2xlarge i2 i3en.2xlarge r5d.large c5n.9xlarge m5d.12xlarge r5n.xlarge u-12tb1 m5.metal t3a.2xlarge a1.large m5n.4xlarge h1.4xlarge g4dn.12xlarge d2.8xlarge r5.12xlarge m6g.4xlarge m5 t2.large z1d.large t2.2xlarge i3 m5dn.metal r4.2xlarge inf1.6xlarge r3.xlarge r5.metal m5d.16xlarge m5dn.12xlarge r5a.16xlarge m5a.xlarge g2.2xlarge r5d m5d m5d.large p2.xlarge r5d.12xlarge m5n.24xlarge r5a.4xlarge hs1.8xlarge u-12tb1.metal r5dn r5d.xlarge m5n.12xlarge r5d.4xlarge d2.4xlarge g2 m6g.xlarge m5n.large a1.2xlarge p3.16xlarge x1e.xlarge g3.16xlarge c5n.large r5d.2xlarge r5a.large c4.4xlarge r5d.metal m5d.2xlarge r4.4xlarge f1.2xlarge r5d.24xlarge inf1.2xlarge p3.2xlarge r5n.4xlarge r4.16xlarge u-18tb1.metal x1e.4xlarge x1.16xlarge t3.nano i3.4xlarge p3.8xlarge i2.4xlarge m4.large i2.2xlarge inf1.xlarge m5a.4xlarge h1.2xlarge r5n.8xlarge f1.4xlarge r5d.16xlarge r5dn.metal r5 r5n.metal r5n.2xlarge g4dn.16xlarge c5.large m5n.metal r5n x1.32xlarge u-6tb1.metal m1.medium r5ad.xlarge r5n.large c3 x1e i3.16xlarge t3.medium g4dn m5n m5.large r5dn.4xlarge r5.24xlarge m5ad.xlarge g2.8xlarge z1d.2xlarge t3a.micro u-6tb1 g3s.xlarge p2.16xlarge r4.8xlarge m5ad.large m5.24xlarge p3 i3.xlarge r5.xlarge i3en.3xlarge m1.large m5dn.8xlarge c5n.metal c5d.12xlarge m4.2xlarge m5n.8xlarge m5.16xlarge i3en.xlarge m4.16xlarge m5n.2xlarge c4 r3.4xlarge u-9tb1.metal t2.medium m6g.8xlarge x1e.16xlarge g4dn.2xlarge c5.2xlarge c5n.2xlarge h1 c5.metal t2.small m3.xlarge c5d.24xlarge r5a.2xlarge c5.18xlarge m5d.metal z1d.metal t1.micro cc2.8xlarge t3a.nano z1d u-24tb1 c5d.xlarge c5n.18xlarge m4.xlarge m6g.medium c3.4xlarge r5dn.12xlarge p2.8xlarge a1.medium r5a.8xlarge m3 d2.2xlarge x1e.2xlarge r5.8xlarge g4dn.metal m5a.16xlarge r4 m4 c5d.9xlarge r5n.12xlarge c5.9xlarge m2.2xlarge t3.micro t3.large m4.10xlarge u-9tb1 m5n.16xlarge m5.2xlarge i3.metal i3.large m5.xlarge r3.large g3.8xlarge m6g.16xlarge g4dn.8xlarge r5dn.8xlarge r5ad.4xlarge m5d.8xlarge r5n.16xlarge t3a.large p3dn.24xlarge c4.2xlarge m5dn.2xlarge c5d.4xlarge c5d.2xlarge r5a.xlarge m6g.2xlarge r5ad.2xlarge m5a.24xlarge r3 m5a.2xlarge c5.12xlarge r4.large m5a.8xlarge m3.medium m6g.12xlarge m5dn c5.xlarge r5dn.24xlarge m1.small c5n.4xlarge i2.8xlarge i3en.12xlarge g4dn.4xlarge z1d.12xlarge m5ad.12xlarge i3.2xlarge m2.xlarge m5ad.4xlarge r3.8xlarge x1e.8xlarge m5.12xlarge x1e.32xlarge m5a.12xlarge m1.xlarge r5a.24xlarge]

But when it tries to auto-scale, I get this:
Unable to build proper template node for <...> uses the unknown EC2 instance type "inf1.6xlarge"
Now, this doesn't make sense, because inf1.6xlarge does appear in the previously generated list of EC2 instances.

What do you think it's going wrong here?

@gjtempleton
Copy link
Member

I have an inkling on the 1.15.6 issue you've seen, but need to double check this in a cluster before I can say for sure.

In terms of the main issue though, can I ask what tags you have on the ASG when you're trying to scale from zero? Are there any tags on the ASG itself at the AWS level along the lines of

k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/infa:         1
k8s.io/cluster-autoscaler/node-template/resources/hugepages-2Mi:               256Mi

?

@RobertLucian
Copy link
Author

@gjtempleton I have something along the lines of

k8s.io/cluster-autoscaler/cortex-dev-1 | owned | Yes |  
k8s.io/cluster-autoscaler/enabled | true | Yes |  
k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/infa | true | Yes |  
k8s.io/cluster-autoscaler/node-template/label/workload | true | Yes |  
k8s.io/cluster-autoscaler/node-template/taint/dedicated | aws.amazon.com/infa=true

I see there was a bug regarding the dynamically generated instance types in #3109. This may make my issue go away since I could revert back to 1.15.x. That's really great!

Is there a timeline of when this could make it in as patch for 1.15.7? Or do you think it would be better to build my own version and use that instead? I'd need this fixed ASAP.

@RobertLucian
Copy link
Author

@gjtempleton applying #3110 to version 1.16.5 and then adding the k8s.io/cluster-autoscaler/node-template/resources/ tags to ASGs seemed to have fixed my problem. I transitioned to EKS 1.16 anyway, so that's why I didn't use 1.15 anymore.

I'm still testing, so I might be back with some more feedback. Until then, I wouldn't close the ticket.

@Jeffwan
Copy link
Contributor

Jeffwan commented May 12, 2020

inf1 has not been officially supported on EKS, it will be ready pretty soon (next week). Thanks for reporting issues here.

@allamand
Copy link

Hi, I'm also having thie problem

E0616 13:21:18.348242       1 utils.go:321] Unable to build proper template node for NodeGroup-112HT3JOFIPYR: ASG "NodeGroup-112HT3JOFIPYR" uses the unknown EC2 instance type "inf1.2xlarge"

I'm using cluster-autoscaler:v1.15.6, do you have recommendation ?

@Jeffwan
Copy link
Contributor

Jeffwan commented Jul 6, 2020

/assign @Jeffwan

@Jeffwan
Copy link
Contributor

Jeffwan commented Jul 7, 2020

@allamand

Can you use this option? It will load all instance types in runtime

awsUseStaticInstanceList = flag.Bool("aws-use-static-instance-list", false, "Should CA fetch instance types in runtime or use a static list. AWS only")

@Jeffwan
Copy link
Contributor

Jeffwan commented Jul 7, 2020

aws-use-static-instance-list is not enabled in some CA versions. We won't patch new instance types in every release

Check here for more details. Feel free to reopen the issue
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#use-static-instance-list

@k8s-ci-robot
Copy link
Contributor

@Jeffwan: Closing this issue.

In response to this:

aws-use-static-instance-list is not enabled in some CA versions. We won't patch new instance types in every release

Check here for more details. Feel free to reopen the issue
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#use-static-instance-list

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Jeffwan
Copy link
Contributor

Jeffwan commented Jul 7, 2020

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants