AWS: Update documentation #3198

otterley · 2020-06-05T04:23:03Z

Update AWS README document for:

Clarity
Document fully the use of tags on Auto Scaling Groups to advertise resources to Cluster Autoscaler
Emphasize the use of IAM Roles for Service accounts where available for security
Fix some grammatical/spelling errors
Wrap lines

otterley · 2020-06-05T04:25:31Z

arhea · 2020-06-05T16:50:24Z

Is it worth mentioning how Cluster Autoscaler relates to Managed Node Groups (cluster autoscaler tags are pre-configured) and AWS Fargate (not needed)?

otterley · 2020-06-05T19:45:28Z

/assign @Jeffwan

Jeffwan · 2020-06-08T15:57:42Z

I was off last week and I will review it by end of day. Thanks for contribution!

Jeffwan · 2020-06-09T17:08:51Z

cluster-autoscaler/cloudprovider/aws/README.md

-**NOTE**: You can restrict the target resources for the autoscaling actions by
-specifying autoscaling group ARNS. More information can be found
+In addition, we also recommend adding `autoscaling:DescribeLaunchConfigurations`
+(if you created your ASG using a Launch Configuration) and/or


This is confusing. usually, user just need LC or LT permission

Jeffwan · 2020-06-09T17:15:51Z

cluster-autoscaler/cloudprovider/aws/README.md

@@ -71,153 +92,158 @@ env:
  value: YOUR_AWS_REGION
 ```

-## Deployment Specification
-Auto-Discovery Setup is always preferred option to avoid multiple, potentially different configuration for min/max values. If you want to adjust minimum and maximum size of the group, please adjust size on ASG directly, CA will fetch latest change when talking to ASG.
+## Auto-Discovery Setup


Do we need to make this change? Probably change to following makes more sense? They should belong to one header 2

Deployment Specification

Auto-Discovery

Manual Configuration

One ASG

Multiple ASG

Master Node Setup

Jeffwan · 2020-06-09T17:16:29Z

cluster-autoscaler/cloudprovider/aws/README.md


-## Scaling a node group to 0


This is only being used by scale from 0 case. Why do you delete header?

The previous language suggested that if there were no pods scheduled on an ASG that there might be a reason for the ASG size not to scale down to zero, and that it could be corrected somehow by tagging. Or, that scaling up from zero was an important special case. Having re-read the documentation and the source code a number of times, I'm not sure that scaling down to zero or up from zero is a special case. Am I mistaken?

Jeffwan · 2020-06-09T17:18:51Z

cluster-autoscaler/cloudprovider/aws/README.md


-To run a cluster-autoscaler which auto-discovers ASGs with nodes use the `--node-group-auto-discovery` flag. For example, `--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>` will find the ASGs where those tag keys
-_exist_. It does not matter what value the tags have.
+Each Auto Scaling Group should be comprised of instance types that provide


Even mixed instance policy is supported. This is not something we recommend users to use. It never guarantee all the simulation is accurate because if instances ASG bring up are unknown to CA. I don't recommend we talk about this here. It would be better to move to MixedInstancePolicy section. The best practice is still one ASG -> one Launch Template -> One Instance type.

I agree that using a single ASG with a single instance type is the simplest case for most customers, and we can come up with verbiage to emphasize that. However, we do recommend using Mixed Instance policies under several circumstances: (1) customers using Spot Instances, and (2) very large customers who benefit from instance diversity to address node shortages in certain regions/AZs.

I disagree that a single ASG is "best practice" however, since we even support multiple ASGs with Managed Node Groups out of the box and encourage customers to use it where it makes sense for them.

Mixed Instance policy is not mature enough as a solution and user needs to take the risk and have to know all prerequisites to use it. It's just one presentation of spot, there're group of users using spot separately without MIP. I may not explain it clearly, my point is single instance per ASG is best practice, not single ASG. You can definitely has more ASGs/node groups.

Jeffwan · 2020-06-09T17:19:58Z

cluster-autoscaler/cloudprovider/aws/README.md

-    ]
-}
-```
+Cluster Autoscaler < 1.15: `cloud.google.com/gke-accelerator=<gpu-type>`


Need to change line here.

Jeffwan · 2020-06-09T17:21:20Z

cluster-autoscaler/cloudprovider/aws/README.md

- If you're running multiple ASGs, the `--expander` flag supports three options: `random`, `most-pods` and `least-waste`. `random` will expand a random ASG on scale up. `most-pods` will scale up the ASG that will schedule the most amount of pods. `least-waste` will expand the ASG that will waste the least amount of CPU/MEM resources. In the event of a tie, cluster autoscaler will fall back to `random`.
- If you're managing your own kubelets, they need to be started with the `--provider-id` flag. The provider id has the format `aws:///<availability-zone>/<instance-id>`, e.g. `aws:///us-east-1a/i-01234abcdef`.
- If you want to use regional STS endpoints (e.g. when using VPC endpoint for STS) the env `AWS_STS_REGIONAL_ENDPOINTS=regional` should be set.
+* The `/etc/ssl/certs/ca-bundle.crt` should exist by default on ec2 instance in


It's kind of hard for me to check line by line. Can you summaries what's your change?
Basically, our principle is to minimize the changes..

It might be easier if you review the full README rather than review as a line by line patch, because I've completely reorganized the document for clarity. https://github.com/kubernetes/autoscaler/blob/3c046e7233d0d1dcb9cb46fd57f37c5dd189344e/cluster-autoscaler/cloudprovider/aws/README.md

cluster-autoscaler/cloudprovider/aws/README.md

jaypipes · 2020-06-08T16:55:50Z

cluster-autoscaler/cloudprovider/aws/README.md


-### Gotchas
+`<gpu-type>` varies by instance type. On P2 instances, for example, the
+value is `nvidia-tesla-k80`.


How does the user find out what the supported GPU types are? Do we have a link to that?

That's a good question. I lack this knowledge, personally.

This is reserved for future GPU optimization, user doesn't have to use nvidia-tesla-k80 at this moment, any identifier will work. While, we plan to ask MNG to apply these labels by default.

otterley · 2020-06-12T16:18:57Z

Resolves #2786

ari-becker · 2020-06-12T21:18:24Z

Resolves #2786

Yeah, this PR would certainly do this. Thanks @otterley - looks fantastic to me.

Not that it matters so much for this PR, but I would draw your attention to the partial solution (partial because it doesn't cover every EC2 instance type, only the ones we use) we wrote in Dhall and open-sourced for the problem of trying to compute the correct list of instances that can be put into a Mixed Instance Policy. This was essential to allow downstream to specify a desired image type, without having to provide all of the potential alternatives as well.

https://github.com/coralogix/dhall-aws/blob/55db0fb33d95e8a5e99b943bda1f74d23c388bba/ec2/InstanceType.dhall#L301

It surprisingly turned out to be one of the more profitable uses of our time - we prefer to use a lowest-cost allocation policy, and we would've never considered adding instance families like m5dn as their on-demand prices are higher than corresponding m5 instances. Apparently, most of AWS's users suffer the same fallacy, and we found AWS launching m5dn spot instances which, surprisingly, were sometimes cheaper than m5 spot instances. So, I applaud explicitly pointing out that instances like m5d can be added alongside m5.

Jeffwan · 2020-06-22T23:21:02Z

/lgtm

@jaypipes any other feedbacks or concerns?

cluster-autoscaler/cloudprovider/aws/README.md

Jeffwan · 2020-06-29T17:44:23Z

@otterley Can you check why you have code changes in this PR? There're lots of dependency conflicts.I assume this is only doc change.

Jeffwan · 2020-06-29T18:55:18Z

@otterley One more thing, could you squash 3 commits to 1 commit? This repo doesn't use squash based. One commit for one change makes commit history clean. Thanks!

otterley · 2020-06-29T19:02:39Z

@Jeffwan Done!

jaypipes

/lgtm
/approve

@otterley really like these changes as a whole, thank you very much!

k8s-ci-robot · 2020-06-29T20:20:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jaypipes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/aws/OWNERS~~ [jaypipes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 5, 2020

k8s-ci-robot requested review from aleksandra-malinowska and losipiuk June 5, 2020 04:23

otterley mentioned this pull request Jun 5, 2020

[EKS] [request]: Nodegroup should support tagging ASGs aws/containers-roadmap#608

Open

k8s-ci-robot assigned jaypipes Jun 5, 2020

otterley force-pushed the update-aws-docs branch from e22674a to 6763f10 Compare June 5, 2020 04:28

otterley force-pushed the update-aws-docs branch 3 times, most recently from 6455873 to 3c046e7 Compare June 5, 2020 19:38

k8s-ci-robot assigned Jeffwan Jun 5, 2020

otterley mentioned this pull request Jun 8, 2020

AWS: Use smallest instance type found in MixedInstancesPolicy #3205

Closed

Jeffwan suggested changes Jun 9, 2020

View reviewed changes

jaypipes reviewed Jun 9, 2020

View reviewed changes

otterley mentioned this pull request Jun 12, 2020

Unclear documentation on permissible alternatives for AWS ASG MixedInstancesPolicy #2786

Closed

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 22, 2020

ellistarn reviewed Jun 23, 2020

View reviewed changes

cluster-autoscaler/cloudprovider/aws/README.md Outdated Show resolved Hide resolved

otterley force-pushed the update-aws-docs branch from c0a254d to 8d87ebd Compare June 29, 2020 17:44

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 29, 2020

AWS: Update documentation

87dfbf8

otterley force-pushed the update-aws-docs branch from 8d87ebd to 87dfbf8 Compare June 29, 2020 19:02

jaypipes approved these changes Jun 29, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 29, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 29, 2020

k8s-ci-robot merged commit 323509a into kubernetes:master Jun 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: Update documentation #3198

AWS: Update documentation #3198

otterley commented Jun 5, 2020

otterley commented Jun 5, 2020

arhea commented Jun 5, 2020

otterley commented Jun 5, 2020

Jeffwan commented Jun 8, 2020

Jeffwan Jun 9, 2020

Jeffwan Jun 9, 2020

Jeffwan Jun 9, 2020

otterley Jun 9, 2020

Jeffwan Jun 9, 2020

otterley Jun 9, 2020

Jeffwan Jun 11, 2020

Jeffwan Jun 9, 2020

Jeffwan Jun 9, 2020

otterley Jun 9, 2020

jaypipes Jun 8, 2020

otterley Jun 9, 2020

Jeffwan Jun 11, 2020

otterley commented Jun 12, 2020

ari-becker commented Jun 12, 2020

Jeffwan commented Jun 22, 2020

Jeffwan commented Jun 29, 2020

Jeffwan commented Jun 29, 2020

otterley commented Jun 29, 2020

jaypipes left a comment

k8s-ci-robot commented Jun 29, 2020

AWS: Update documentation #3198

AWS: Update documentation #3198

Conversation

otterley commented Jun 5, 2020

otterley commented Jun 5, 2020

arhea commented Jun 5, 2020

otterley commented Jun 5, 2020

Jeffwan commented Jun 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

otterley commented Jun 12, 2020

ari-becker commented Jun 12, 2020

Jeffwan commented Jun 22, 2020

Jeffwan commented Jun 29, 2020

Jeffwan commented Jun 29, 2020

otterley commented Jun 29, 2020

jaypipes left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 29, 2020