-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data/aws: Bump default instance type from t2.medium to t3.medium #724
data/aws: Bump default instance type from t2.medium to t3.medium #724
Conversation
I don't think we want to move towards more reliability on defaults in terraform. Also this causes the master machine definitions to drift from the machines that actually exist. /hold |
Should the master machine definitions be changed as well? We must assume our default node types will change over time (even after the initial 4.0 release). |
This same default feeds the master definitions. |
let's set some requirements:
3-15 nodes is a good small-medium cluster, 100 namespaces. e2e pushes harder than that on API traffic, but runs on 3 nodes densely. |
7d236d0
to
3bdf83b
Compare
With 7d236d0 -> 3bdf83b, I've dropped some more Terraform defaults to make sure we aren't relying on them. I've also dropped the We'll see how 3bdf83b plays in CI, since we're not sure the initial CPU credits are sufficient to get us through the e2e tests without paying too much for the extended burst. |
I'm patching to get that passed through... |
Actually, we probably don't need |
I've filed #742 for this. |
The last consumers for these were removed by 124ac35 (*: Use machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).
And the related, libvirt-specific, tectonic_libvirt_worker_ips. This simplifies the Terraform logic for AWS and OpenStack, and consistently pushes most worker setup into the cluster-API providers who are creating the workers since 124ac35 (*: Use machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).
/hold cancel /approve |
3bdf83b
to
5156348
Compare
I've pushed 3bdf83b -> 5156348, rebasing onto #742 and adding a fairly rough commit to fix the asset graph feeding |
9c9a657
to
2b76d62
Compare
2b76d62
to
0dc7ae8
Compare
85e86cf
to
12716e8
Compare
12716e8
to
30521b4
Compare
I've pushed 85e86cf -> 8ed4d9f, dropping the |
The T3 instances were launched in 2018-08-21 [1]. The documents stats are fairly similar to the T2 analogs [2,3]: Name t2.medium t3.medium vCPUs 2 2 Memory (GiB) 4.0 4.0 Baseline Performance/vCPU 20% CPU Credits earned/hr 24 24 Network burst bandwidth (Gbps) 5 EBS burst bandwidth (Gbps) 1.50 On-Demand Price/hr* $0.0464 $0.0418 1-yr Reserved Instance Effective Hourly* $0.025 3-yr Reserved Instance Effective Hourly* $0.017 * Prices shown are for Linux/Unix in US East (Northern Virginia) AWS Region. Prices for 1-year and 3-year reserved instances are for "Partial Upfront" payment options or "No upfront" for instances without the Partial Upfront option. For full pricing details, see the Amazon EC2 pricing page [4]. The only obvious difference there is that the T3 instances are 10% cheaper. However, from [1]: If the instance runs at higher CPU utilization for a prolonged period, there will be an additional charge of $0.05 per vCPU-hour. and from [2]: T3 instances start in Unlimited mode by default... If the instance needs to run at higher CPU utilization for a prolonged period, it can do so for an additional charge of $0.05 per vCPU-hour. In Standard mode, a T3 instance can burst until it uses up all its earned credits. The difference vs. T2 seems to be that with T2, Unlimited mode was opt-in while with T3 it is opt-out. You can set a credit specification in Terraform [5] to opt out, but there doesn't seem to be a setting in the cluster-API structures for this. And in any case, we expect: * Most users to be able to run clusters of up to ~15 workers without overextending an Unlimited t3.medium. * To have cluster alerts for when nodes have exhausted their CPU quota, which will tell cluster admins that it's time to rotate in some larger master nodes. We may already have this alerting, but I'm not sure where it lives. * To have CI be able to get through its e2e tests before exhausting the initial CPU credits. But we can monitor this with the above-mentioned alerting as well. The FIXME is because pkg/tfvars is consuming the install-config, so it doesn't know about configuration that landed further downstream in the asset graph's pkg/asset/machines. I have a partial fix for that, but Abhinav wanted to punt that to future work [6]. [1]: https://aws.amazon.com/blogs/aws/new-t3-instances-burstable-cost-effective-performance/ [2]: https://aws.amazon.com/ec2/instance-types/t2/ [3]: https://aws.amazon.com/ec2/instance-types/t3/ [4]: https://aws.amazon.com/ec2/pricing/ [5]: https://www.terraform.io/docs/providers/aws/r/instance.html#credit-specification [6]: openshift#724 (comment)
30521b4
to
8ed4d9f
Compare
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test e2e-aws |
IAMRoleName: m.Platform.AWS.IAMRoleName, | ||
WorkerRootVolume: aws.WorkerRootVolume{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this why workers are now 16gb volumes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this why workers are now 16gb volumes?
This is just passing data to Terraform, and worker creation is via the cluster-API now. So I don't see a connection.
/lgtm To unblock all the queues |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, smarterclayton, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
45m of this run was tear down, and it looks like a big chunk of that was route53. We have a ton of garbage in the api, probably needs to be cleaned |
I cleaned it ;) |
Hrm, this is going to be a problem with $ oc --namespace=openshift-cluster-api logs clusterapi-manager-controllers-559fc8d878-d5p65 --container=machine-controller
E1130 05:38:29.574505 1 actuator.go:96] error creating machine: error launching instance: error creating EC2 instance: InstanceLimitExceeded: You have requested more instances (21) than your current instance limit of 20 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
status code: 400, request id: 01da30c5-c310-48c0-9550-79cc78cccd7e (from this cluster). Can we get some serialization so we can limp through until we get that limit raised? Switch to some other instance type where we have more power than |
#765 is in flight to get us on |
The T3 instances were launched in 2018-08-21. The documents stats are fairly similar to the T2 analogs. The only obvious difference there is that the T3 instances are 10% cheaper. However, from here:
and from here:
The difference vs. T2 seems to be that with T2, Unlimited mode was opt-in while with T3 it is opt-out. I've set a credit specification in Terraform to opt us out, but there doesn't seem to be a setting in the cluster-API structures for this.