-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: Update documentation #3198
Conversation
/assign @jaypipes |
Is it worth mentioning how Cluster Autoscaler relates to Managed Node Groups (cluster autoscaler tags are pre-configured) and AWS Fargate (not needed)? |
6455873
to
3c046e7
Compare
/assign @Jeffwan |
I was off last week and I will review it by end of day. Thanks for contribution! |
**NOTE**: You can restrict the target resources for the autoscaling actions by | ||
specifying autoscaling group ARNS. More information can be found | ||
In addition, we also recommend adding `autoscaling:DescribeLaunchConfigurations` | ||
(if you created your ASG using a Launch Configuration) and/or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing. usually, user just need LC or LT permission
@@ -71,153 +92,158 @@ env: | |||
value: YOUR_AWS_REGION | |||
``` | |||
|
|||
## Deployment Specification | |||
Auto-Discovery Setup is always preferred option to avoid multiple, potentially different configuration for min/max values. If you want to adjust minimum and maximum size of the group, please adjust size on ASG directly, CA will fetch latest change when talking to ASG. | |||
## Auto-Discovery Setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make this change? Probably change to following makes more sense? They should belong to one header 2
Deployment Specification
- Auto-Discovery
- Manual Configuration
- One ASG
- Multiple ASG
- Master Node Setup
|
||
## Scaling a node group to 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only being used by scale from 0
case. Why do you delete header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous language suggested that if there were no pods scheduled on an ASG that there might be a reason for the ASG size not to scale down to zero, and that it could be corrected somehow by tagging. Or, that scaling up from zero was an important special case. Having re-read the documentation and the source code a number of times, I'm not sure that scaling down to zero or up from zero is a special case. Am I mistaken?
|
||
To run a cluster-autoscaler which auto-discovers ASGs with nodes use the `--node-group-auto-discovery` flag. For example, `--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>` will find the ASGs where those tag keys | ||
_exist_. It does not matter what value the tags have. | ||
Each Auto Scaling Group should be comprised of instance types that provide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even mixed instance policy is supported. This is not something we recommend users to use. It never guarantee all the simulation is accurate because if instances ASG bring up are unknown to CA. I don't recommend we talk about this here. It would be better to move to MixedInstancePolicy section. The best practice is still one ASG -> one Launch Template -> One Instance type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that using a single ASG with a single instance type is the simplest case for most customers, and we can come up with verbiage to emphasize that. However, we do recommend using Mixed Instance policies under several circumstances: (1) customers using Spot Instances, and (2) very large customers who benefit from instance diversity to address node shortages in certain regions/AZs.
I disagree that a single ASG is "best practice" however, since we even support multiple ASGs with Managed Node Groups out of the box and encourage customers to use it where it makes sense for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixed Instance policy is not mature enough as a solution and user needs to take the risk and have to know all prerequisites to use it. It's just one presentation of spot, there're group of users using spot separately without MIP. I may not explain it clearly, my point is single instance per ASG is best practice, not single ASG. You can definitely has more ASGs/node groups.
] | ||
} | ||
``` | ||
Cluster Autoscaler < 1.15: `cloud.google.com/gke-accelerator=<gpu-type>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to change line here.
- If you're running multiple ASGs, the `--expander` flag supports three options: `random`, `most-pods` and `least-waste`. `random` will expand a random ASG on scale up. `most-pods` will scale up the ASG that will schedule the most amount of pods. `least-waste` will expand the ASG that will waste the least amount of CPU/MEM resources. In the event of a tie, cluster autoscaler will fall back to `random`. | ||
- If you're managing your own kubelets, they need to be started with the `--provider-id` flag. The provider id has the format `aws:///<availability-zone>/<instance-id>`, e.g. `aws:///us-east-1a/i-01234abcdef`. | ||
- If you want to use regional STS endpoints (e.g. when using VPC endpoint for STS) the env `AWS_STS_REGIONAL_ENDPOINTS=regional` should be set. | ||
* The `/etc/ssl/certs/ca-bundle.crt` should exist by default on ec2 instance in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of hard for me to check line by line. Can you summaries what's your change?
Basically, our principle is to minimize the changes..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier if you review the full README rather than review as a line by line patch, because I've completely reorganized the document for clarity. https://github.com/kubernetes/autoscaler/blob/3c046e7233d0d1dcb9cb46fd57f37c5dd189344e/cluster-autoscaler/cloudprovider/aws/README.md
|
||
### Gotchas | ||
`<gpu-type>` varies by instance type. On P2 instances, for example, the | ||
value is `nvidia-tesla-k80`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the user find out what the supported GPU types are? Do we have a link to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I lack this knowledge, personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reserved for future GPU optimization, user doesn't have to use nvidia-tesla-k80
at this moment, any identifier will work. While, we plan to ask MNG to apply these labels by default.
Resolves #2786 |
Yeah, this PR would certainly do this. Thanks @otterley - looks fantastic to me. Not that it matters so much for this PR, but I would draw your attention to the partial solution (partial because it doesn't cover every EC2 instance type, only the ones we use) we wrote in Dhall and open-sourced for the problem of trying to compute the correct list of instances that can be put into a Mixed Instance Policy. This was essential to allow downstream to specify a desired image type, without having to provide all of the potential alternatives as well. It surprisingly turned out to be one of the more profitable uses of our time - we prefer to use a lowest-cost allocation policy, and we would've never considered adding instance families like |
/lgtm @jaypipes any other feedbacks or concerns? |
@otterley Can you check why you have code changes in this PR? There're lots of dependency conflicts.I assume this is only doc change. |
c0a254d
to
8d87ebd
Compare
@otterley One more thing, could you squash 3 commits to 1 commit? This repo doesn't use squash based. One commit for one change makes commit history clean. Thanks! |
8d87ebd
to
87dfbf8
Compare
@Jeffwan Done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
@otterley really like these changes as a whole, thank you very much!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jaypipes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Update AWS README document for: